-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial v1alpha2 MPIJob API Spec #95
Conversation
We had discussed this earlier. Do you want to keep copy of Common_types.go separately from the one used for TF/PyTorch ? i agree that the current location is not ideal and it has to kept out of the operator repos. |
I moved @terrytangyuan and @madhukarkm to approvers. @madhukarkm, can you review this to make sure it works for our internal needs? |
Sure Rong. @terrytangyuan: can you please add myself and @Nivedita-V as reviewers for this PR. Reg common status types, agree that refactoring to move common_types out of tf-operator would be best. Till then, seems better to have just one copy in tf-operator and dependency on it rather than two copies. |
I would agree that keeping one copy for common types is better than duplicating and possibly diverging the two APIs. |
@terrytangyuan: GitHub didn't allow me to request PR reviews from the following users: madhukarkm, Nivedita-V. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@madhukarkm @Nivedita-V I was not able to assign you to review this PR for now since you are not member of kubeflow Github org yet. Please go ahead and review this when you get a chance to see if this new API spec works for your internal needs. @johnugeorge @richardsliu @madhukarkm Agreed that we should avoid duplicating common types and possible divergence. I'll switch to use the one in |
Sounds good |
I've addressed all the comments. Please take a look again when you get a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @terrytangyuan. Few comments especially reg the ProcessingSpec.
// ReplicaStatus represents the current observed state of the replica. | ||
type ReplicaStatus struct { | ||
// The number of actively running pods. | ||
Active int32 `json:"active,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a Pending replica status -- besides Active, Succeeded, Failed?
May be useful for tf-operator also.. especially with coscheduling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that should probably be added once we have a common repo. We can focus on MPIJob's spec for this PR.
// RestartPolicy describes how the replicas should be restarted. | ||
// Only one of the following restart policies may be specified. | ||
// If none of the following policies is specified, the default one | ||
// is RestartPolicyAlways. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L77 in ReplicaSpec says default for RestartPolicy is Never. Better to make these consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's focus on MPIJob's spec for this PR to avoid any divergent changes in common types. I'll keep these notes unresolved here to remind us addressing them later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK to handle these later.
pkg/apis/kubeflow/v1alpha2/types.go
Outdated
// Specifies the desired number of processing units the MPIJob should run on. | ||
// Mutually exclusive with `ReplicaSpec.Replicas` in `MPIReplicaSpecs`. | ||
// +optional | ||
Units *int32 `json:"units,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will these be converted to replica pod spec.. what about memory in the case of cpu type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. This is not currently considered inside ProcessingSpec
. We could add memory, disk, etc. to the spec. Another idea is that we could get rid of ProcessingSpec
and make fully use of MPIReplicaSpecs
which will allow users to provide specific pod spec. One feedback I often receive from our users is that having both ProcessingSpec
and MPIReplicaSpecs
are confusing. @rongou What do you think the best approach would be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with getting rid of ProcessingSpec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. @madhukarkm @rongou I've just removed ProcessingSpec. Please take another look when you get a chance. This is just initial API spec. I also added some TODOs that will be addressed later once we have a common operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to remove ProcessingSpec.. looks like ResourceType can also be removed.
// RestartPolicy describes how the replicas should be restarted. | ||
// Only one of the following restart policies may be specified. | ||
// If none of the following policies is specified, the default one | ||
// is RestartPolicyAlways. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK to handle these later.
@madhukarkm Thanks. Fixed. |
/lgtm Let's wait till tomorrow in case there are any other comments. |
@madhukarkm: changing LGTM is restricted to assignees, and only kubeflow/mpi-operator repo collaborators may be assigned issues. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rongou The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is the initial attempt for v1alpha2 MPIJob API Spec. See discussion in #92 for some of the decisions.
Some main differences from
v1alpha1
are thatv1alpha2
:GPUs
andGPUsPerNode
. This is the remaining work from Support processing resource types other than GPU #75 and Move processing unit specific flags to MPIJobSpec #85.Template
withMPIReplicaSpecs
that uses the commonReplicaSpec
, similar to what's done inpytorch-operator
andtf-operator
. See separate out worker and launcher pod specs #54 and Launcher and worker statuses do not correctly indicate the underlying states #90.MPIJobLauncherStatusType
with the commonJobStatus
similar topytorch-operator
andtf-operator
.Note that
pkg/apis/kubeflow/v1alpha2/common_types.go
is copied directly fromtf-operator
since I don't believempi-operator
should depend ontf-operator
. We can switch to use a common repo once it's ready. We should continue the discussion.I also pushed the auto-generated v1alpha2 client code. It would be easier for you to review if you focus on changes in
pkg/apis/kubeflow/v1alpha2/types.go
.cc: @rongou @anfeng @jlewi @everpeace @gaocegege @Nivedita-V @madhukarkm @ywskycn @ScorpioCPH @jian-he @cheyang @richardsliu @johnugeorge @Jeffwan
This change is![Reviewable](https://camo.githubusercontent.com/23b05f5fb48215c989e92cc44cf6512512d083132bd3daf689867c8d9d386888/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)