Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add initial version of the roadmap #159

Merged
merged 3 commits into from Dec 13, 2019
Merged

[RFC] Add initial version of the roadmap #159

merged 3 commits into from Dec 13, 2019

Conversation

terrytangyuan
Copy link
Member

@terrytangyuan terrytangyuan commented Dec 10, 2019

Hi community,

This is the initial version of the roadmap for MPI Operator and we would appreciate any feedback on this. Feel free to suggest any new items for future releases.

Signed-off-by: terrytangyuan terrytangyuan@gmail.com


This change is Reviewable

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

## CI/CD

* Automate the process to publish images to Docker Hub whenever there's new release/commit. Related issue: [#93](https://github.com/kubeflow/mpi-operator/issues/93).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a solution to this in KFServing using Cloud Build and Git Tags. Would love to discuss if you can find a better solution, or want to use ours.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I was thinking of just using Travis but that may require storing access tokens that might be visible to other kubeflow org members. It would be great if we could reuse any existing solutions that already work well for KFServing.

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It will help us make MPI operator more mature. Thanks!

ROADMAP.md Outdated
* Decouple the tight dependency on Open MPI and support other collective communication frameworks.
Related issue: [#12](https://github.com/kubeflow/mpi-operator/issues/12).
* Support new versions of MPI Operator in [kubeflow/manifests](https://github.com/kubeflow/manifests).
* Resign different components of MPI Operator to support fault tolerant collective communication frameworks such as [caicloud/ftlib](https://github.com/caicloud/ftlib).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @carmark

If you are interested in it.

Copy link
Member

@carmark carmark Dec 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaocegege Thanks. Shall we split this target into small pieces, such as MPI-FT with node/card replacement, and then MPI-FT with dynamic nodes/cards scale?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already implement both features in ftlib

/cc @zw0610

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our fault-tolerant design can handle both situations. However, so far, we do not consider any MPI implementations. The collective communication is handled by other libraries such like NCCL and Gloo.

@kubeflow kubeflow deleted a comment from antshuwen Dec 11, 2019
@k8s-ci-robot
Copy link

@gaocegege: GitHub didn't allow me to request PR reviews from the following users: zw0610.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

I think we already implement both features in ftlib

/cc @zw0610

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gaocegege
Copy link
Member

@richardsliu Should we lower the test coverage threshold for this repo, too?

ROADMAP.md Outdated
* Decouple the tight dependency on Open MPI and support other collective communication frameworks.
Related issue: [#12](https://github.com/kubeflow/mpi-operator/issues/12).
* Support new versions of MPI Operator in [kubeflow/manifests](https://github.com/kubeflow/manifests).
* Resign different components of MPI Operator to support fault tolerant collective communication frameworks such as [caicloud/ftlib](https://github.com/caicloud/ftlib).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean redesign?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed.

Copy link
Member

@rongou rongou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot
Copy link

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Dec 13, 2019
@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign rongou
You can assign the PR to them by writing /assign @rongou in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@terrytangyuan
Copy link
Member Author

terrytangyuan commented Dec 13, 2019

Two LGTMs already. I am manually merging this now as the test coverage check won't pass anyways. Feel free to add further comments/suggestions/feedback after the merge too.

@terrytangyuan terrytangyuan merged commit 48944f4 into kubeflow:master Dec 13, 2019
@terrytangyuan terrytangyuan deleted the roadmap branch December 13, 2019 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants