Initial v1alpha2 MPIJob API Spec #95

terrytangyuan · 2019-03-05T23:31:28Z

This is the initial attempt for v1alpha2 MPIJob API Spec. See discussion in #92 for some of the decisions.

Some main differences from v1alpha1 are that v1alpha2:

Removes deprecated fields that are GPU specific, specifically GPUs and GPUsPerNode. This is the remaining work from Support processing resource types other than GPU #75 and Move processing unit specific flags to MPIJobSpec #85.
Replaces Template with MPIReplicaSpecs that uses the common ReplicaSpec, similar to what's done in pytorch-operator and tf-operator. See separate out worker and launcher pod specs #54 and Launcher and worker statuses do not correctly indicate the underlying states #90.
Replaces MPIJobLauncherStatusType with the common JobStatus similar to pytorch-operator and tf-operator.

Note that pkg/apis/kubeflow/v1alpha2/common_types.go is copied directly from tf-operator since I don't believe mpi-operator should depend on tf-operator. We can switch to use a common repo once it's ready. We should continue the discussion.

I also pushed the auto-generated v1alpha2 client code. It would be easier for you to review if you focus on changes in pkg/apis/kubeflow/v1alpha2/types.go.

cc: @rongou @anfeng @jlewi @everpeace @gaocegege @Nivedita-V @madhukarkm @ywskycn @ScorpioCPH @jian-he @cheyang @richardsliu @johnugeorge @Jeffwan

This change is

johnugeorge · 2019-03-06T01:30:45Z

We had discussed this earlier. Do you want to keep copy of Common_types.go separately from the one used for TF/PyTorch ? i agree that the current location is not ideal and it has to kept out of the operator repos.

rongou · 2019-03-06T17:44:04Z

I moved @terrytangyuan and @madhukarkm to approvers.

@madhukarkm, can you review this to make sure it works for our internal needs?

madhukarkm · 2019-03-06T23:01:46Z

Sure Rong. @terrytangyuan: can you please add myself and @Nivedita-V as reviewers for this PR.

Reg common status types, agree that refactoring to move common_types out of tf-operator would be best. Till then, seems better to have just one copy in tf-operator and dependency on it rather than two copies.

richardsliu · 2019-03-06T23:54:04Z

I would agree that keeping one copy for common types is better than duplicating and possibly diverging the two APIs.

pkg/apis/kubeflow/v1alpha2/types.go

terrytangyuan · 2019-03-07T06:56:08Z

/cc @madhukarkm @Nivedita-V

k8s-ci-robot · 2019-03-07T06:56:09Z

@terrytangyuan: GitHub didn't allow me to request PR reviews from the following users: madhukarkm, Nivedita-V.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @madhukarkm @Nivedita-V

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

terrytangyuan · 2019-03-07T07:06:13Z

@madhukarkm @Nivedita-V I was not able to assign you to review this PR for now since you are not member of kubeflow Github org yet. Please go ahead and review this when you get a chance to see if this new API spec works for your internal needs.

@johnugeorge @richardsliu @madhukarkm Agreed that we should avoid duplicating common types and possible divergence. I'll switch to use the one in tf-operator in a follow-up PR to keep this PR minimal for reviewing the new MPIJob API spec since vendor update would be very messy and hard to review.

pkg/apis/kubeflow/v1alpha2/types.go

…n UnitsPerNode

johnugeorge · 2019-03-08T03:36:12Z

Sounds good

terrytangyuan · 2019-03-11T14:37:05Z

I've addressed all the comments. Please take a look again when you get a chance.

madhukarkm

Thanks @terrytangyuan. Few comments especially reg the ProcessingSpec.

madhukarkm · 2019-03-12T21:45:42Z

pkg/apis/kubeflow/v1alpha2/common_types.go

+// ReplicaStatus represents the current observed state of the replica.
+type ReplicaStatus struct {
+	// The number of actively running pods.
+	Active int32 `json:"active,omitempty"`


Can we add a Pending replica status -- besides Active, Succeeded, Failed?

May be useful for tf-operator also.. especially with coscheduling.

Sure that should probably be added once we have a common repo. We can focus on MPIJob's spec for this PR.

madhukarkm · 2019-03-12T21:50:14Z

pkg/apis/kubeflow/v1alpha2/common_types.go

+// RestartPolicy describes how the replicas should be restarted.
+// Only one of the following restart policies may be specified.
+// If none of the following policies is specified, the default one
+// is RestartPolicyAlways.


L77 in ReplicaSpec says default for RestartPolicy is Never. Better to make these consistent.

Let's focus on MPIJob's spec for this PR to avoid any divergent changes in common types. I'll keep these notes unresolved here to remind us addressing them later.

OK to handle these later.

pkg/apis/kubeflow/v1alpha2/types.go

madhukarkm · 2019-03-12T22:23:11Z

pkg/apis/kubeflow/v1alpha2/types.go

+	// Specifies the desired number of processing units the MPIJob should run on.
+	// Mutually exclusive with `ReplicaSpec.Replicas` in `MPIReplicaSpecs`.
+	// +optional
+	Units *int32 `json:"units,omitempty"`


Will these be converted to replica pod spec.. what about memory in the case of cpu type?

Good question. This is not currently considered inside ProcessingSpec. We could add memory, disk, etc. to the spec. Another idea is that we could get rid of ProcessingSpec and make fully use of MPIReplicaSpecs which will allow users to provide specific pod spec. One feedback I often receive from our users is that having both ProcessingSpec and MPIReplicaSpecs are confusing. @rongou What do you think the best approach would be?

I'm fine with getting rid of ProcessingSpec.

Sounds good. @madhukarkm @rongou I've just removed ProcessingSpec. Please take another look when you get a chance. This is just initial API spec. I also added some TODOs that will be addressed later once we have a common operator.

pkg/apis/kubeflow/v1alpha2/types.go

madhukarkm

Sounds good to remove ProcessingSpec.. looks like ResourceType can also be removed.

madhukarkm · 2019-03-15T20:51:49Z

pkg/apis/kubeflow/v1alpha2/common_types.go

+// RestartPolicy describes how the replicas should be restarted.
+// Only one of the following restart policies may be specified.
+// If none of the following policies is specified, the default one
+// is RestartPolicyAlways.


OK to handle these later.

pkg/apis/kubeflow/v1alpha2/types.go

terrytangyuan · 2019-03-16T01:19:08Z

Sounds good to remove ProcessingSpec.. looks like ResourceType can also be removed.

@madhukarkm Thanks. Fixed.

madhukarkm · 2019-03-16T02:10:29Z

/lgtm

Let's wait till tomorrow in case there are any other comments.

k8s-ci-robot · 2019-03-16T02:10:36Z

@madhukarkm: changing LGTM is restricted to assignees, and only kubeflow/mpi-operator repo collaborators may be assigned issues.

In response to this:

/lgtm

Let's wait till tomorrow in case there are any other comments.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rongou · 2019-03-16T22:23:30Z

/lgtm
/approve

k8s-ci-robot · 2019-03-16T22:23:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rongou

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rongou]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Initial v1alpha2 MPIJob API Spec

71fec59

k8s-ci-robot requested review from everpeace and gaocegege March 5, 2019 23:31

k8s-ci-robot added the size/L label Mar 5, 2019

terrytangyuan mentioned this pull request Mar 5, 2019

MPI Operator v1alpha2 API Design Proposal #92

Closed

terrytangyuan added 2 commits March 5, 2019 15:54

Update zz_generated.deepcopy.go

48d48f6

Generate v1alpha2 client code

76844e3

k8s-ci-robot added size/XXL and removed size/L labels Mar 6, 2019

richardsliu reviewed Mar 7, 2019

View reviewed changes

pkg/apis/kubeflow/v1alpha2/types.go Outdated Show resolved Hide resolved

richardsliu reviewed Mar 7, 2019

View reviewed changes

pkg/apis/kubeflow/v1alpha2/types.go Outdated Show resolved Hide resolved

Added ProcessingSpec struct and removed unnecessary Replicas

de74043

Nivedita-V reviewed Mar 8, 2019

View reviewed changes

pkg/apis/kubeflow/v1alpha2/types.go Outdated Show resolved Hide resolved

pkg/apis/kubeflow/v1alpha2/types.go Outdated Show resolved Hide resolved

Add ResourceType const with corev1.ResourceName type and added note i…

548e603

…n UnitsPerNode

madhukarkm reviewed Mar 12, 2019

View reviewed changes

richardsliu mentioned this pull request Mar 15, 2019

Proposal for a Common Operator kubeflow/training-operator#960

Closed

Removed ProcessingSpec and added some TODOs

6da293c

madhukarkm reviewed Mar 15, 2019

View reviewed changes

Remove unused ResourceType

05f172c

k8s-ci-robot assigned rongou Mar 16, 2019

k8s-ci-robot added the lgtm label Mar 16, 2019

k8s-ci-robot added the approved label Mar 16, 2019

k8s-ci-robot merged commit b3c25ea into kubeflow:master Mar 16, 2019

terrytangyuan deleted the v1alpha2-spec branch March 16, 2019 22:32

terrytangyuan mentioned this pull request Mar 24, 2019

Add v1alpha2 MPIJob controller #98

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial v1alpha2 MPIJob API Spec #95

Initial v1alpha2 MPIJob API Spec #95

terrytangyuan commented Mar 5, 2019 •

edited

Loading

johnugeorge commented Mar 6, 2019

rongou commented Mar 6, 2019

madhukarkm commented Mar 6, 2019

richardsliu commented Mar 6, 2019

terrytangyuan commented Mar 7, 2019

k8s-ci-robot commented Mar 7, 2019

terrytangyuan commented Mar 7, 2019 •

edited

Loading

johnugeorge commented Mar 8, 2019

terrytangyuan commented Mar 11, 2019

madhukarkm left a comment

madhukarkm Mar 12, 2019 •

edited

Loading

terrytangyuan Mar 13, 2019

madhukarkm Mar 12, 2019 •

edited

Loading

terrytangyuan Mar 13, 2019 •

edited

Loading

madhukarkm Mar 15, 2019

madhukarkm Mar 12, 2019

terrytangyuan Mar 13, 2019

rongou Mar 15, 2019

terrytangyuan Mar 15, 2019

madhukarkm left a comment

madhukarkm Mar 15, 2019

terrytangyuan commented Mar 16, 2019

madhukarkm commented Mar 16, 2019

k8s-ci-robot commented Mar 16, 2019

rongou commented Mar 16, 2019

k8s-ci-robot commented Mar 16, 2019

Initial v1alpha2 MPIJob API Spec #95

Initial v1alpha2 MPIJob API Spec #95

Conversation

terrytangyuan commented Mar 5, 2019 • edited Loading

johnugeorge commented Mar 6, 2019

rongou commented Mar 6, 2019

madhukarkm commented Mar 6, 2019

richardsliu commented Mar 6, 2019

terrytangyuan commented Mar 7, 2019

k8s-ci-robot commented Mar 7, 2019

terrytangyuan commented Mar 7, 2019 • edited Loading

johnugeorge commented Mar 8, 2019

terrytangyuan commented Mar 11, 2019

madhukarkm left a comment

Choose a reason for hiding this comment

madhukarkm Mar 12, 2019 • edited Loading

Choose a reason for hiding this comment

terrytangyuan Mar 13, 2019

Choose a reason for hiding this comment

madhukarkm Mar 12, 2019 • edited Loading

Choose a reason for hiding this comment

terrytangyuan Mar 13, 2019 • edited Loading

Choose a reason for hiding this comment

madhukarkm Mar 15, 2019

Choose a reason for hiding this comment

madhukarkm Mar 12, 2019

Choose a reason for hiding this comment

terrytangyuan Mar 13, 2019

Choose a reason for hiding this comment

rongou Mar 15, 2019

Choose a reason for hiding this comment

terrytangyuan Mar 15, 2019

Choose a reason for hiding this comment

madhukarkm left a comment

Choose a reason for hiding this comment

madhukarkm Mar 15, 2019

Choose a reason for hiding this comment

terrytangyuan commented Mar 16, 2019

madhukarkm commented Mar 16, 2019

k8s-ci-robot commented Mar 16, 2019

rongou commented Mar 16, 2019

k8s-ci-robot commented Mar 16, 2019

terrytangyuan commented Mar 5, 2019 •

edited

Loading

terrytangyuan commented Mar 7, 2019 •

edited

Loading

madhukarkm Mar 12, 2019 •

edited

Loading

madhukarkm Mar 12, 2019 •

edited

Loading

terrytangyuan Mar 13, 2019 •

edited

Loading