Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add initial version of the roadmap #159

Merged
merged 3 commits into from Dec 13, 2019
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
24 changes: 24 additions & 0 deletions ROADMAP.md
@@ -0,0 +1,24 @@
# Roadmap of MPI Operator

This document provides a high-level overview of where MPI Operator will grow in future releases. See discussions in the original RFC [here](https://github.com/kubeflow/mpi-operator/pull/159).

## New Features / Enhancements

* Decouple the tight dependency on Open MPI and support other collective communication frameworks.
Related issue: [#12](https://github.com/kubeflow/mpi-operator/issues/12).
* Support new versions of MPI Operator in [kubeflow/manifests](https://github.com/kubeflow/manifests).
* Resign different components of MPI Operator to support fault tolerant collective communication frameworks such as [caicloud/ftlib](https://github.com/caicloud/ftlib).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @carmark

If you are interested in it.

Copy link
Member

@carmark carmark Dec 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaocegege Thanks. Shall we split this target into small pieces, such as MPI-FT with node/card replacement, and then MPI-FT with dynamic nodes/cards scale?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already implement both features in ftlib

/cc @zw0610

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our fault-tolerant design can handle both situations. However, so far, we do not consider any MPI implementations. The collective communication is handled by other libraries such like NCCL and Gloo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean redesign?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed.

* Allow more flexible RBAC when `MPIJob`s so existing RBAC resources can be reused. Related issue: [#20](https://github.com/kubeflow/mpi-operator/issues/20).
* Support installation of MPI Operator via [Helm](https://github.com/helm/helm). Related issue: [#11](https://github.com/kubeflow/mpi-operator/issues/11).
* Support [Go modules](https://blog.golang.org/migrating-to-go-modules).
* Consider support launching framework-specific services such as [TensorBoard](https://www.tensorflow.org/tensorboard) and [Horovod Timeline](https://github.com/horovod/horovod#horovod-timeline). Since [tf-operator](https://github.com/kubeflow/tf-operator) already supports TensorBoard, we may want to consider moving this to [kubeflow/common](https://github.com/kubeflow/common) so it can be reused. Related issue: [#138](https://github.com/kubeflow/mpi-operator/issues/138).

## CI/CD

* Automate the process to publish images to Docker Hub whenever there's new release/commit. Related issue: [#93](https://github.com/kubeflow/mpi-operator/issues/93).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a solution to this in KFServing using Cloud Build and Git Tags. Would love to discuss if you can find a better solution, or want to use ours.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I was thinking of just using Travis but that may require storing access tokens that might be visible to other kubeflow org members. It would be great if we could reuse any existing solutions that already work well for KFServing.

* Ensure new versions of `deploy/mpi-operator.yaml` are always compatible with [kubeflow/manifests](https://github.com/kubeflow/manifests).
* Add end-to-end tests via Kubeflow's testing infrastructure. Related issue: [#9](https://github.com/kubeflow/mpi-operator/issues/9).

## Bug Fixes

* Better statuses of launcher and worker pods. Related issues: [#90](https://github.com/kubeflow/mpi-operator/issues/90)