Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create & label P1 issues needed for an initial release of Horovod support #778

Closed
jlewi opened this issue May 8, 2018 · 25 comments
Closed
Labels
area/horovod Issues related to Horvod area/openmpi Issues related to openmpi area/0.4.0 priority/p1
Milestone

Comments

@jlewi
Copy link
Contributor

jlewi commented May 8, 2018

We should create issues for everything that needs to happen to do an initial release of Horvod support. All such issues should be priority/p1.

We'd like to include Horvod in our 0.2 release. We need to figure out all the work that needs to happen to support that.

I created the label area/horvod

Anyone in the org should be able to label issues as follows

/area horvod
/priority p1

@jiezhang @everpeace Could you take a stab at coming up with a list of issues and labeling existing issues?

@k8s-ci-robot k8s-ci-robot added area/horovod Issues related to Horvod priority/p1 labels May 8, 2018
@jiezhang
Copy link

jiezhang commented May 8, 2018

Thanks @jlewi.

What's the timeline for 0.2 release? I think we should rename the area as openmpi. The package was written in a way that it's not coupled with horovod. For example, @everpeace is using the package to run distributed machine learning tasks which has nothing to do with horovod.

cc @alsrgv

@jiezhang
Copy link

jiezhang commented May 8, 2018

/area openmpi

@jiezhang
Copy link

jiezhang commented May 8, 2018

@jlewi Looks like the area command is not working for me.

@jlewi
Copy link
Contributor Author

jlewi commented May 8, 2018

The area/openmpi label doesn't exist. Can we use the area/horvod?

@jiezhang
Copy link

jiezhang commented May 8, 2018

@jlewi The package is not limited to horovod. It can run any MPI job. That's why I propose to rename the label as openmpi.

@everpeace
Copy link
Contributor

everpeace commented May 8, 2018

@jlewi Yeah, naming is a bit complicated. as jiezhang said, openmpi package is technically independent from Horovod. And, actually, I'm using this package with Chainer/ChainerMN.

Could you take a stab at coming up with a list of issues and labeling existing issues?

Sure. I'm happy to do it. Here are several new issues which I would like to put in 0.2. "new" means that they have not created yet. What do you think @jiezhang ??

  • support securityContext(especially RunAs,RunAsGroup)
    • In our cluster, we use NFS. when mounting NFS, we prohibit users to run pods with root from security reason.
  • support volume and volume mounts (relates to [openmpi] Upload trained models to persistent storage  #713 ?)
  • pave e2e test (at least on basic sample??)
  • (low priority) support initContainer
    • Imagine pulling user code via git-sync or so.

labeling existing issues?

@jlewi I opened three PR yesterday, How should I handle them? I think I would create stab isseues for them and put the label. Is that ok?? Or can I put labels to PR directly? cc/ @jiezhang

off topic here though, I'm now thinking I would make a proposal of chainer-operator by the way.

@jlewi
Copy link
Contributor Author

jlewi commented May 9, 2018

I created the label area/openmpi

@pineking
Copy link
Member

pineking commented May 9, 2018

support securityContext(especially RunAs)
In our cluster, we use NFS. when mounting NFS, we prohibit users to run pods with root from security reason.

@everpeace securityContext(RunAs) does not support group id, how can you specify gid in container when your programs read/write data using NFS

@everpeace
Copy link
Contributor

/area openmpi

@k8s-ci-robot k8s-ci-robot added the area/openmpi Issues related to openmpi label May 9, 2018
@everpeace
Copy link
Contributor

everpeace commented May 9, 2018

@pineking RunAsGroup is already on feature-gate since 1.10.

However, this RunAsGroup didn't work nicely with NFS VolumeSource. So we patched by ourselves (it is a small patch though). Of course, we plan to contribute it to the upstream.

To secure NFS data, you will need to stripSETUID capability in pod security policy too. Otherwise, users can build sudo-able conatainer.

// I updated my original comment.

@pineking
Copy link
Member

@everpeace Great, Thanks for your information. I will try the RunAsGroup feature, and look forward to your PR of the patch.

@jiezhang
Copy link

Thanks @jlewi . We'll create/tag issues needed for initial release.

@pdmack
Copy link
Member

pdmack commented May 10, 2018

's/horvod/horovod/g'

@jlewi jlewi changed the title Create & label P1 issues needed for an initial release of Horvod support Create & label P1 issues needed for an initial release of Horovod support May 11, 2018
@jlewi
Copy link
Contributor Author

jlewi commented May 21, 2018

@jiezhang @everpeace I see two issues labeled area/openmpi ? Is the list complete? Does it include items related to Horovod and Open MPI support?

@everpeace
Copy link
Contributor

everpeace commented May 21, 2018

@jlewi I would like to support one more feature of volumes/volumeMounts support (#838) if I could, sorry it was added just now, once kubeflow supports ks 0.10.2(#727). I set p2 to it because I'm not sure #727 was fixed before kubeflow 0.2 🙇 This is the last issue from me for kubeflow 0.2.

@jbottum
Copy link
Contributor

jbottum commented Sep 30, 2018

/area 0.4.0

@jlewi
Copy link
Contributor Author

jlewi commented Oct 8, 2018

@everpeace What is the status openmpi support? Can we close this issue and open up appropriate issues tagged 0.4.0?

@everpeace
Copy link
Contributor

everpeace commented Oct 9, 2018

yes, please 🙇 But as a personal opinion, encouraging users to migrate to MPIJob CRDs would be nicer.

@jlewi
Copy link
Contributor Author

jlewi commented Oct 14, 2018

@everpeace Can you explain that last comment? Are you saying that in 0.4 we should remove
https://github.com/kubeflow/kubeflow/tree/master/kubeflow/openmpi

And tell people to use the
mpi job crd
https://github.com/kubeflow/kubeflow/tree/master/kubeflow/mpi-job

@everpeace
Copy link
Contributor

everpeace commented Oct 15, 2018

@jlewi sorry for the confusion.

Are you saying that in 0.4 we should remove

Yes, I was. But, in this several days, I understood that there are some active users using openmpi package.

So, I think we don't need to remove this package at least in 0.4. However, I would like to inform users the fact that they have an option to use mpi crd instead of openmpi package. Because, openmpi package is based on bare pods so no fault tolerance (it can NOT retry when failure), but mpi-operator expands mpi crd to kind:Job and kind:StatefulSet so it has fault-tolerance (it CAN retry when failure).

@jlewi
Copy link
Contributor Author

jlewi commented Oct 16, 2018

SGTM.

Any suggestion about how we should start pushing users to use the CRD?

@everpeace
Copy link
Contributor

everpeace commented Oct 23, 2018

I would suggest putting some note on openmpi package README which recommends using MPIJob instead of openmpi package with an explicit description of MPIJob advantages (retry behavior etc.).

@chrisheecho
Copy link

Hello all, could someone give status update of this issue? would love to connect with the driver for this..is this you @everpeace and @jiezhang ?

@jlewi
Copy link
Contributor Author

jlewi commented Oct 25, 2018

I filed kubeflow/website#272 to update the docs.

#1859 to remove the existing package.

@everpeace
Copy link
Contributor

Thanks for creating issues 🙇

@carmine carmine added this to the 0.4.0 milestone Nov 6, 2018
surajkota pushed a commit to surajkota/kubeflow that referenced this issue Jun 13, 2022
kubeflow#778)

* Use kustomize to make it easier to maintain the versioned KFDef specs.

* For each KF release we need to define a KFDef spec that overrides certain
  values (e.g. the repo of kubeflow/manifests it uses)

* Previously we just did this by modifying the KFDef specs on the release
  branch. But this was very costly to maintain; i.e. backporting changes
  on master to the release branch

* To make that easier we can generate the KFDef YAML files using kustomize;
  this allows us to use overlays to define the changes needed to customize
  the specs for a particular version

* We can keep these versioned overlays on master so that the divergence between
  master and the branches is very low

* To preserve existing behavior we still check in YAML files. a simple
  script build_kfdef_specs.py is provided to generate them.

Related to: kubeflow#4685

* Set namespace.

* Update docs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/horovod Issues related to Horvod area/openmpi Issues related to openmpi area/0.4.0 priority/p1
Projects
None yet
Development

No branches or pull requests

9 participants