Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openmpi: master/workers sync mechanism is replaced with k8s api from redis #696

Merged

Conversation

everpeace
Copy link
Contributor

@everpeace everpeace commented Apr 20, 2018

Currently, openmpi package uses redis to sync master/workers. But I think it is overkilled a bit. And we can watch master/workers status via k8s api.

In a pod, service account's token and kubernetes api server endpoint is embedded. So, we can see pods status just executing curl like this.

# I think --insecure is safe because this will be executed in a pod (inside of kubernetes cluster)
curl --insecure \
  --header "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
  https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/api/v1/namespaces/$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace)/pods

So, I replaced master/workers sync mechanism with k8s api from redis.

What do you think?? I'm very happy if I could have some feedbacks.


This change is Reviewable

@everpeace
Copy link
Contributor Author

cc/ @jiezhang @pdmack would you mind reviewing this?

@everpeace everpeace force-pushed the openmpi-replace-redis-with-k8s-api branch from de9a84a to 62cb90d Compare April 20, 2018 17:16
@pdmack
Copy link
Member

pdmack commented Apr 20, 2018

/ok-to-test

@everpeace everpeace force-pushed the openmpi-replace-redis-with-k8s-api branch from 62cb90d to 41aa2d1 Compare April 20, 2018 17:36
@everpeace
Copy link
Contributor Author

everpeace commented Apr 20, 2018

@pdmack I'm so sorry. I force pushed my branch because I had typo... Could you /ok-to-test again 🙇 ??

@pdmack
Copy link
Member

pdmack commented Apr 20, 2018

It's fine. Test is running. Stay tuned.

@pdmack
Copy link
Member

pdmack commented Apr 20, 2018

/approve
/lgtm

Thanks @everpeace !

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pdmack

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit c93a252 into kubeflow:master Apr 20, 2018
@jiezhang
Copy link

@everpeace @pdmack The worker reaches "running" state once init.sh starts running, but at that point the worker is not ready yet - it still has a bunch of initialization to finish.

@jiezhang
Copy link

But I do agree that introducing redis for synchronization wasn't an optimal solution.

@everpeace
Copy link
Contributor Author

everpeace commented Apr 20, 2018

@jiezhang

  • it still has a bunch of initialization to finish.

ah, you're right!! Then, the master can wait until ssh to all workers succeed. What do you think?

I created PR about this issue. Could you review this??

saffaalvi pushed a commit to StatCan/kubeflow that referenced this pull request Feb 11, 2021
…redis (kubeflow#696)

* openmpi: master/workers sync mechanism is replaced with k8s api from redis.

* openmpi: add everpeace to contributors.
surajkota pushed a commit to surajkota/kubeflow that referenced this pull request Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants