Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openmpi: slots clause should be generated when gpus '> 0' #692

Merged
merged 1 commit into from Apr 20, 2018

Conversation

everpeace
Copy link
Contributor

@everpeace everpeace commented Apr 20, 2018

@jiezhang It's really tiny fix. could you review it?? We should generate slots= clause even when gpus==1.


This change is Reviewable

@jiezhang
Copy link

I think it’s not needed when gpus == 1. I tested that and MPI was able to run properly.

According to the doc: https://www.open-mpi.org/faq/?category=running#mpirun-hostfile default value of slots is 1.

@everpeace
Copy link
Contributor Author

everpeace commented Apr 20, 2018

@jiezhang oh, in my manual test, no slots= hosts seem to have #cpus slots. I'm using Open MPI v2.1.2 with cuda support. Which version of openmpi you tested?? or Did I miss something??

$ ks param list
COMPONENT   PARAM           VALUE
=========   =====           =====
train-mnist exec            "mpiexec --allow-run-as-root --display-map --hostfile /kubeflow/openmpi/assets/hostfile -n 4__masked__"
train-mnist gpus            1
train-mnist image           "__masked__"
train-mnist imagePullPolicy "IfNotPresent"
train-mnist init            "null"
train-mnist name            "train-mnist"
train-mnist namespace       "null"
train-mnist schedulerName   "default-scheduler"
train-mnist secret          "openmpi-secret"
train-mnist workers         4

$ k exec -it openmpi-master bash

root@openmpi-master:/# ompi_info --version
Open MPI v2.1.2
http://www.open-mpi.org/community/help/

root@openmpi-master:/# cat kubeflow/openmpi/assets/hostfile
openmpi-worker-0.train-mnist.kubeflow
openmpi-worker-1.train-mnist.kubeflow
openmpi-worker-2.train-mnist.kubeflow
openmpi-worker-3.train-mnist.kubeflow

/# mpiexec --allow-run-as-root -n 4 --display-map --hostfile /kubeflow/openmpi/assets/hostfile sh -c 'echo $(hostname):hello'
...
 Data for JOB [13036,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: openmpi-worker-0.train-mnist.kubeflow Num slots: 16   Max slots: 0    Num procs: 4
        Process OMPI jobid: [13036,1] App: 0 Process rank: 0 Bound: N/A
        Process OMPI jobid: [13036,1] App: 0 Process rank: 1 Bound: N/A
        Process OMPI jobid: [13036,1] App: 0 Process rank: 2 Bound: N/A
        Process OMPI jobid: [13036,1] App: 0 Process rank: 3 Bound: N/A

 =============================================================
openmpi-worker-0:hello
openmpi-worker-0:hello
openmpi-worker-0:hello
openmpi-worker-0:hello

FYI, explicit --map-by node works as I expected.

root@openmpi-master:/# mpiexec --allow-run-as-root --map-by node -n 4 --display-map --hostfile /kubeflow/openmpi/assets/hostfile sh -c 'echo $(hostname):hello'
...
 Data for JOB [12810,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: openmpi-worker-0.train-mnist.kubeflow Num slots: 16   Max slots: 0    Num procs: 1
        Process OMPI jobid: [12810,1] App: 0 Process rank: 0 Bound: N/A

 Data for node: openmpi-worker-1.train-mnist.kubeflow Num slots: 16   Max slots: 0    Num procs: 1
        Process OMPI jobid: [12810,1] App: 0 Process rank: 1 Bound: N/A

 Data for node: openmpi-worker-2.train-mnist.kubeflow Num slots: 16   Max slots: 0    Num procs: 1
        Process OMPI jobid: [12810,1] App: 0 Process rank: 2 Bound: N/A

 Data for node: openmpi-worker-3.train-mnist.kubeflow Num slots: 16   Max slots: 0    Num procs: 1
        Process OMPI jobid: [12810,1] App: 0 Process rank: 3 Bound: N/A

 =============================================================
openmpi-worker-0:hello
openmpi-worker-1:hello
openmpi-worker-2:hello
openmpi-worker-3:hello

@pdmack
Copy link
Member

pdmack commented Apr 20, 2018

/ok-to-test

@pdmack
Copy link
Member

pdmack commented Apr 20, 2018

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pdmack

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 0bdd1e2 into kubeflow:master Apr 20, 2018
@jiezhang
Copy link

@everpeace I'm on Open MPI v3.0.0. If it's not working with v2.1.2, maybe we need to set slots even if gpus==0? It makes no sense to run all the workloads on worker-0.

@everpeace
Copy link
Contributor Author

@jiezhang

If it's not working with v2.1.2, maybe we need to set slots even if gpus==0?

I don't think so. It would need only when gpus > 0, I think.

saffaalvi pushed a commit to StatCan/kubeflow that referenced this pull request Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants