Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[openmpi] Introduce a sidecar container for inter-pod synchronization #704

Merged
merged 1 commit into from Apr 24, 2018

Conversation

jiezhang
Copy link

@jiezhang jiezhang commented Apr 21, 2018

  • openmpi-controller monitors the master pod's status and creates a semaphore file "term.sig" to signal openmpi-job to terminate
  • openmpi-job is now decoupled from kubernetes
  • openmpi-controller and openmpi-job shares a volume for inter-container communication
  • openmpi-controller can be extended in the future to support data snapshot

This change is Reviewable

@jiezhang
Copy link
Author

/assign @jlewi

@jiezhang
Copy link
Author

/retest

@everpeace
Copy link
Contributor

@jiezhang Thank you for the solution 👍 And my apologies to have introduced insufficient syncs... Don't get me wrong. No offence was meant 🙇 . I just wanted to propose how to modify this package by opening PRs.

openmpi-controller now depends on python's kubernetes client version 6. Does this work on old versions of kubernetes?? I'm not sure on kubeflow's policy but, If it SHOULD, how could we test it ?

@jiezhang
Copy link
Author

@evergreen Thanks for your contribution. The package is still in its early stages. Changes are welcome.

I think it’s hard to maintain init.sh in the long term without making changes to the docker image. I’m introducing a separate container where we have complete control to make future changes easier. We should be able to run the script locally to validate its functionality. I’m planning to add more features to it, e.g. backing up the logs and trained model to persistent storage. It’s much easier to implement more advanced features using python.

And it should be compatible with older versions of k8s according to the documentation: https://github.com/kubernetes-client/python/blob/master/README.md

@everpeace
Copy link
Contributor

I think it’s hard to maintain init.sh in the long term without making changes to the docker image

I agree with it.

I’m planning to add more features to it, e.g. backing up the logs and trained model to persistent storage. It’s much easier to implement more advanced features using python.

That's nice. I would be very happy if I could contribute to it. Is there any space for it?

And it should be compatible with older versions of k8s according to the documentation:

I misunderstood the meaning of '+' mark on the doc. right, It should work.

I'm not one of a reviewer, but, this PR looks good to me 😀

@jiezhang
Copy link
Author

/retest

@pdmack
Copy link
Member

pdmack commented Apr 24, 2018

@jiezhang @everpeace thumbs up? We good?

@jiezhang jiezhang force-pushed the controller branch 3 times, most recently from 3d9a59a to a61a13a Compare April 24, 2018 01:11
* openmpi-controller monitors the master pod's status and creates a semaphore file "term.sig" to signal openmpi-job to terminate
* openmpi-job is now decoupled from kubernetes
* openmpi-controller and openmpi-job shares a volume for inter-container communication
* openmpi-controller can be extended in the future to support data snapshot
@pdmack
Copy link
Member

pdmack commented Apr 24, 2018

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pdmack

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 1976d6b into kubeflow:master Apr 24, 2018
@jiezhang
Copy link
Author

@everpeace I opened #713 to track the future work in the controller. Feel free to provide your feedback there.

saffaalvi pushed a commit to StatCan/kubeflow that referenced this pull request Feb 11, 2021
…kubeflow#704)

* openmpi-controller monitors the master pod's status and creates a semaphore file "term.sig" to signal openmpi-job to terminate
* openmpi-job is now decoupled from kubernetes
* openmpi-controller and openmpi-job shares a volume for inter-container communication
* openmpi-controller can be extended in the future to support data snapshot
surajkota pushed a commit to surajkota/kubeflow that referenced this pull request Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants