Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial v1alpha2 MPIJob API Spec #95

Merged
merged 7 commits into from
Mar 16, 2019
Merged

Initial v1alpha2 MPIJob API Spec #95

merged 7 commits into from
Mar 16, 2019

Conversation

terrytangyuan
Copy link
Member

@terrytangyuan terrytangyuan commented Mar 5, 2019

This is the initial attempt for v1alpha2 MPIJob API Spec. See discussion in #92 for some of the decisions.

Some main differences from v1alpha1 are that v1alpha2:

Note that pkg/apis/kubeflow/v1alpha2/common_types.go is copied directly from tf-operator since I don't believe mpi-operator should depend on tf-operator. We can switch to use a common repo once it's ready. We should continue the discussion.

I also pushed the auto-generated v1alpha2 client code. It would be easier for you to review if you focus on changes in pkg/apis/kubeflow/v1alpha2/types.go.

cc: @rongou @anfeng @jlewi @everpeace @gaocegege @Nivedita-V @madhukarkm @ywskycn @ScorpioCPH @jian-he @cheyang @richardsliu @johnugeorge @Jeffwan


This change is Reviewable

@johnugeorge
Copy link
Member

We had discussed this earlier. Do you want to keep copy of Common_types.go separately from the one used for TF/PyTorch ? i agree that the current location is not ideal and it has to kept out of the operator repos.

@rongou
Copy link
Member

rongou commented Mar 6, 2019

I moved @terrytangyuan and @madhukarkm to approvers.

@madhukarkm, can you review this to make sure it works for our internal needs?

@madhukarkm
Copy link

Sure Rong. @terrytangyuan: can you please add myself and @Nivedita-V as reviewers for this PR.

Reg common status types, agree that refactoring to move common_types out of tf-operator would be best. Till then, seems better to have just one copy in tf-operator and dependency on it rather than two copies.

@richardsliu
Copy link

I would agree that keeping one copy for common types is better than duplicating and possibly diverging the two APIs.

@terrytangyuan
Copy link
Member Author

/cc @madhukarkm @Nivedita-V

@k8s-ci-robot
Copy link

@terrytangyuan: GitHub didn't allow me to request PR reviews from the following users: madhukarkm, Nivedita-V.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @madhukarkm @Nivedita-V

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@terrytangyuan
Copy link
Member Author

terrytangyuan commented Mar 7, 2019

@madhukarkm @Nivedita-V I was not able to assign you to review this PR for now since you are not member of kubeflow Github org yet. Please go ahead and review this when you get a chance to see if this new API spec works for your internal needs.

@johnugeorge @richardsliu @madhukarkm Agreed that we should avoid duplicating common types and possible divergence. I'll switch to use the one in tf-operator in a follow-up PR to keep this PR minimal for reviewing the new MPIJob API spec since vendor update would be very messy and hard to review.

pkg/apis/kubeflow/v1alpha2/types.go Outdated Show resolved Hide resolved
pkg/apis/kubeflow/v1alpha2/types.go Outdated Show resolved Hide resolved
@johnugeorge
Copy link
Member

Sounds good

@terrytangyuan
Copy link
Member Author

I've addressed all the comments. Please take a look again when you get a chance.

Copy link

@madhukarkm madhukarkm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @terrytangyuan. Few comments especially reg the ProcessingSpec.

// ReplicaStatus represents the current observed state of the replica.
type ReplicaStatus struct {
// The number of actively running pods.
Active int32 `json:"active,omitempty"`
Copy link

@madhukarkm madhukarkm Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a Pending replica status -- besides Active, Succeeded, Failed?

May be useful for tf-operator also.. especially with coscheduling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that should probably be added once we have a common repo. We can focus on MPIJob's spec for this PR.

// RestartPolicy describes how the replicas should be restarted.
// Only one of the following restart policies may be specified.
// If none of the following policies is specified, the default one
// is RestartPolicyAlways.
Copy link

@madhukarkm madhukarkm Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L77 in ReplicaSpec says default for RestartPolicy is Never. Better to make these consistent.

Copy link
Member Author

@terrytangyuan terrytangyuan Mar 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's focus on MPIJob's spec for this PR to avoid any divergent changes in common types. I'll keep these notes unresolved here to remind us addressing them later.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to handle these later.

pkg/apis/kubeflow/v1alpha2/types.go Outdated Show resolved Hide resolved
// Specifies the desired number of processing units the MPIJob should run on.
// Mutually exclusive with `ReplicaSpec.Replicas` in `MPIReplicaSpecs`.
// +optional
Units *int32 `json:"units,omitempty"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these be converted to replica pod spec.. what about memory in the case of cpu type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. This is not currently considered inside ProcessingSpec. We could add memory, disk, etc. to the spec. Another idea is that we could get rid of ProcessingSpec and make fully use of MPIReplicaSpecs which will allow users to provide specific pod spec. One feedback I often receive from our users is that having both ProcessingSpec and MPIReplicaSpecs are confusing. @rongou What do you think the best approach would be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with getting rid of ProcessingSpec.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. @madhukarkm @rongou I've just removed ProcessingSpec. Please take another look when you get a chance. This is just initial API spec. I also added some TODOs that will be addressed later once we have a common operator.

pkg/apis/kubeflow/v1alpha2/types.go Outdated Show resolved Hide resolved
Copy link

@madhukarkm madhukarkm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to remove ProcessingSpec.. looks like ResourceType can also be removed.

// RestartPolicy describes how the replicas should be restarted.
// Only one of the following restart policies may be specified.
// If none of the following policies is specified, the default one
// is RestartPolicyAlways.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to handle these later.

pkg/apis/kubeflow/v1alpha2/types.go Outdated Show resolved Hide resolved
@terrytangyuan
Copy link
Member Author

Sounds good to remove ProcessingSpec.. looks like ResourceType can also be removed.

@madhukarkm Thanks. Fixed.

@madhukarkm
Copy link

/lgtm

Let's wait till tomorrow in case there are any other comments.

@k8s-ci-robot
Copy link

@madhukarkm: changing LGTM is restricted to assignees, and only kubeflow/mpi-operator repo collaborators may be assigned issues.

In response to this:

/lgtm

Let's wait till tomorrow in case there are any other comments.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rongou
Copy link
Member

rongou commented Mar 16, 2019

/lgtm
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rongou

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit b3c25ea into kubeflow:master Mar 16, 2019
@terrytangyuan terrytangyuan deleted the v1alpha2-spec branch March 16, 2019 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants