-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add initial version of the roadmap #159
Conversation
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
|
||
## CI/CD | ||
|
||
* Automate the process to publish images to Docker Hub whenever there's new release/commit. Related issue: [#93](https://github.com/kubeflow/mpi-operator/issues/93). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a solution to this in KFServing using Cloud Build and Git Tags. Would love to discuss if you can find a better solution, or want to use ours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I was thinking of just using Travis but that may require storing access tokens that might be visible to other kubeflow org members. It would be great if we could reuse any existing solutions that already work well for KFServing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. It will help us make MPI operator more mature. Thanks!
ROADMAP.md
Outdated
* Decouple the tight dependency on Open MPI and support other collective communication frameworks. | ||
Related issue: [#12](https://github.com/kubeflow/mpi-operator/issues/12). | ||
* Support new versions of MPI Operator in [kubeflow/manifests](https://github.com/kubeflow/manifests). | ||
* Resign different components of MPI Operator to support fault tolerant collective communication frameworks such as [caicloud/ftlib](https://github.com/caicloud/ftlib). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @carmark
If you are interested in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaocegege Thanks. Shall we split this target into small pieces, such as MPI-FT with node/card replacement, and then MPI-FT with dynamic nodes/cards scale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we already implement both features in ftlib
/cc @zw0610
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our fault-tolerant design can handle both situations. However, so far, we do not consider any MPI implementations. The collective communication is handled by other libraries such like NCCL and Gloo.
@gaocegege: GitHub didn't allow me to request PR reviews from the following users: zw0610. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@richardsliu Should we lower the test coverage threshold for this repo, too? |
ROADMAP.md
Outdated
* Decouple the tight dependency on Open MPI and support other collective communication frameworks. | ||
Related issue: [#12](https://github.com/kubeflow/mpi-operator/issues/12). | ||
* Support new versions of MPI Operator in [kubeflow/manifests](https://github.com/kubeflow/manifests). | ||
* Resign different components of MPI Operator to support fault tolerant collective communication frameworks such as [caicloud/ftlib](https://github.com/caicloud/ftlib). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean redesign
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
New changes are detected. LGTM label has been removed. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Two LGTMs already. I am manually merging this now as the test coverage check won't pass anyways. Feel free to add further comments/suggestions/feedback after the merge too. |
Hi community,
This is the initial version of the roadmap for MPI Operator and we would appreciate any feedback on this. Feel free to suggest any new items for future releases.
Signed-off-by: terrytangyuan terrytangyuan@gmail.com
This change is