-
Notifications
You must be signed in to change notification settings - Fork 932
Ensure we use the first compute node's topology for mapping #480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
Refer to this link for build results (access rights to CI server needed): |
|
i applied this patch to v1.8 and retried. It fails: $salloc -p pivy -N 2 --ntasks-per-node=12 /labhome/miked/workspace/git/mellanox-hpc/ompi-release/debug-v1.8/install/bin/mpirun -mca hwloc_base_verbose 100 -mca ess_base_verbose 100 -mca plm_base_verbose 10 --cpu-set 12,13,14,15,17,18,19,20,21,22,23 --bind-to core --tag-output --timestamp-output --display-map --map-by node -mca pml yalla -x MXM_TLS=ud,self,shm -x MXM_RDMA_PORTS=mlx5_1:1 /hpc/mtr_scrap/users/mtt/scratch/mxm/20150316_203008_62575_82530_r-hp01/installs/HBRa/tests/mpi-test-suite/ompi-tests/mpi_test_suite/mpi_test_suite -x relaxed -t 'All,^One-sided' -n 300
salloc: Granted job allocation 82571
salloc: Waiting for resource configuration
salloc: Nodes r-hp[01-02] are ready for job
...
[hpchead:01257] hwloc:base:get_topology
[hpchead:01257] hwloc:base: filtering cpuset
[hpchead:01257] Searching for 12 LOGICAL PU
[hpchead:01257] Searching for 13 LOGICAL PU
[hpchead:01257] Searching for 14 LOGICAL PU
[hpchead:01257] Searching for 15 LOGICAL PU
[hpchead:01257] Searching for 17 LOGICAL PU
--------------------------------------------------------------------------
A specified logical processor does not exist in this topology:
CPU number: 17
Cpu set given: 12,13,14,15,17,18,19,20,21,22,23
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
topology discovery failed
--> Returned value (null) (0) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
salloc: Relinquishing job allocation 82571 |
|
This patch is against master, not 1.8 - we may require a custom patch for 1.8. Could you please try the master? |
|
will do, had some troubles with pmix |
|
same fail on master |
|
Weird - please rerun with -mca plm_base_verbose 5 so we can see what happened |
|
here it goes: ftp://bgate.mellanox.com/upload/pr480_out.txt |
|
AHHHH - I see the problem. mpirun is attempting to filter the topology using the given cpuset, and since those cpu's don't exist there, it never gets to the point of looking at the compute node topology. This will take a bit more pondering to solve. |
…that no other compute nodes are involved. This deals with the corner case where mpirun is executing on a node of different topology from the compute nodes.
|
@miked-mellanox Please try this now. I believe I fixed it |
|
bot:retest |
|
|
Refer to this link for build results (access rights to CI server needed): |
… on every node. We can then run everything thru the filter as before, which ensures that any procs run on mpirun are also contained within the specified cpuset.
|
bot:retest |
|
|
Refer to this link for build results (access rights to CI server needed): |
|
@rhc54 - thanks for fixes. Now it worked, but :
|
|
@miked-mellanox I had to make an additional correction - see if this now works correctly for you |
|
|
Refer to this link for build results (access rights to CI server needed): Build Log Test FAILed. |
|
@miked-mellanox Looks like your Jenkins tests are broken - this test has nothing to do with my changes. |
|
but it fails only on this patchset and passes w/o it $ export XXX_C=3 XXX_D=4 XXX_E=5
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun -np 2 -mca mca_base_env_list 'XXX_A=1;XXX_B=2;XXX_C;XXX_D;XXX_E' env
++ grep '^XXX_'
++ wc -l
+ val=0
+ '[' 0 -ne 10 ']'
`` |
|
Weird - it makes no sense. Maybe I'm not fully updated on something on this branch? I don't touch the envars anywhere in this, nor did I touch the mca system in any way. |
|
Hmmm...I see the issue. Has nothing to do with that command. I'll work on it. |
|
|
Refer to this link for build results (access rights to CI server needed): |
|
with new patch, got this: $salloc -p pivy -N 2 --ntasks-per-node=12 /labhome/miked/workspace/git/mellanox-hpc/ompi-release/debug-master-480/install/bin/mpirun --cpu-set 12,13,14,15,16,17,18,19,20,21,22,23 --bind-to core --tag-output --timestamp-output --display-map --map-by node -mca pml yalla -x MXM_TLS=ud,self,shm -x MXM_RDMA_PORTS=mlx5_1:1 ~/workspace/git/mellanox-hpc/ompi-release/examples/hello_c
salloc: Granted job allocation 82798
salloc: Waiting for resource configuration
salloc: Nodes r-hp[01-02] are ready for job
[hpchead:23712] [[29402,0],0] ORTE_ERROR_LOG: Bad parameter in file ../../../../orte/mca/rmaps/base/rmaps_base_binding.c at line 745
[hpchead:23712] [[29402,0],0] ORTE_ERROR_LOG: Bad parameter in file ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 353
salloc: Relinquishing job allocation 82798
salloc: Job allocation 82798 has been revoked. |
|
@miked-mellanox okay, try it again |
|
|
Refer to this link for build results (access rights to CI server needed): |
|
woo-hoo! works fine! thanks a lot. |
|
could you please squash it into single commit? |
…leanup plm/alps: remove unneeded env. variable setting
@miked-mellanox This should fix the cpu-set issue you mentioned - please check and verify. If it does, then we'll bring it to 1.8