Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented Mar 17, 2015

@miked-mellanox This should fix the cpu-set issue you mentioned - please check and verify. If it does, then we'll bring it to 1.8

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/356/
Test PASSed.

@mike-dubman
Copy link
Member

i applied this patch to v1.8 and retried. It fails:

$salloc -p pivy -N 2 --ntasks-per-node=12 /labhome/miked/workspace/git/mellanox-hpc/ompi-release/debug-v1.8/install/bin/mpirun -mca hwloc_base_verbose 100  -mca ess_base_verbose 100 -mca plm_base_verbose 10 --cpu-set 12,13,14,15,17,18,19,20,21,22,23   --bind-to core --tag-output --timestamp-output --display-map  --map-by node -mca pml yalla   -x MXM_TLS=ud,self,shm -x MXM_RDMA_PORTS=mlx5_1:1 /hpc/mtr_scrap/users/mtt/scratch/mxm/20150316_203008_62575_82530_r-hp01/installs/HBRa/tests/mpi-test-suite/ompi-tests/mpi_test_suite/mpi_test_suite -x relaxed -t 'All,^One-sided' -n 300
salloc: Granted job allocation 82571
salloc: Waiting for resource configuration
salloc: Nodes r-hp[01-02] are ready for job

...


[hpchead:01257] hwloc:base:get_topology
[hpchead:01257] hwloc:base: filtering cpuset
[hpchead:01257] Searching for 12 LOGICAL PU
[hpchead:01257] Searching for 13 LOGICAL PU
[hpchead:01257] Searching for 14 LOGICAL PU
[hpchead:01257] Searching for 15 LOGICAL PU
[hpchead:01257] Searching for 17 LOGICAL PU
--------------------------------------------------------------------------
A specified logical processor does not exist in this topology:

  CPU number:     17
  Cpu set given:  12,13,14,15,17,18,19,20,21,22,23
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  topology discovery failed
  --> Returned value (null) (0) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
salloc: Relinquishing job allocation 82571

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 17, 2015

This patch is against master, not 1.8 - we may require a custom patch for 1.8. Could you please try the master?

@mike-dubman
Copy link
Member

will do, had some troubles with pmix

@mike-dubman
Copy link
Member

same fail on master

hpchead:23128] Searching for 14 LOGICAL PU
[hpchead:23128] Searching for 15 LOGICAL PU
[hpchead:23128] Searching for 17 LOGICAL PU
--------------------------------------------------------------------------
A specified logical processor does not exist in this topology:

  CPU number:     17
  Cpu set given:  11,12,13,14,15,17,18,19,20,21,22,23
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  topology discovery failed
  --> Returned value (null) (0) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 17, 2015

Weird - please rerun with -mca plm_base_verbose 5 so we can see what happened

@mike-dubman
Copy link
Member

here it goes: ftp://bgate.mellanox.com/upload/pr480_out.txt

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 17, 2015

AHHHH - I see the problem. mpirun is attempting to filter the topology using the given cpuset, and since those cpu's don't exist there, it never gets to the point of looking at the compute node topology. This will take a bit more pondering to solve.

Ralph Castain added 2 commits March 17, 2015 10:46
…that no other compute nodes are involved. This deals with the corner case where mpirun is executing on a node of different topology from the compute nodes.
@rhc54
Copy link
Contributor Author

rhc54 commented Mar 17, 2015

@miked-mellanox Please try this now. I believe I fixed it

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 17, 2015

bot:retest

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/362/
Test PASSed.

… on every node. We can then run everything thru the filter as before, which ensures that any procs run on mpirun are also contained within the specified cpuset.
@rhc54
Copy link
Contributor Author

rhc54 commented Mar 17, 2015

bot:retest

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/363/
Test PASSed.

@mike-dubman
Copy link
Member

@rhc54 - thanks for fixes. Now it worked, but :

  • some errors were printed at the end of the run:
  • is it expected token "UNBOUND"?
  • full log with details is here: ftp://bgate.mellanox.com/upload/480_out.txt
% $salloc -p pivy -N 2 --ntasks-per-node=12 /labhome/miked/workspace/git/mellanox-hpc/ompi-release/debug-master-480/install/bin/mpirun  --cpu-set 12,13,14,15,16,17,18,19,20,21,22,23   --bind-to core --tag-output --timestamp-output --display-map  --map-by node -mca pml yalla   -x MXM_TLS=ud,self,shm -x MXM_RDMA_PORTS=mlx5_1:1 ~/workspace/git/mellanox-hpc/ompi-release/examples/hello_c

========================   JOB MAP   ========================

 Data for node: r-hp01  Num slots: 12   Max slots: 0    Num procs: 12
        Process OMPI jobid: [6958,1] App: 0 Process rank: 0 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 2 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 4 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 6 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 8 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 10 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 12 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 14 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 16 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 18 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 20 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 22 Bound: UNBOUND

 Data for node: r-hp02  Num slots: 12   Max slots: 0    Num procs: 12
        Process OMPI jobid: [6958,1] App: 0 Process rank: 1 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 3 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 5 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 7 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 9 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 11 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 13 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 15 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 17 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 19 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 21 Bound: UNBOUND
        Process OMPI jobid: [6958,1] App: 0 Process rank: 23 Bound: UNBOUND

 =============================================================
Wed Mar 18 09:23:19 2015[1,16]<stdout>:Hello, world, I am 16 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,22]<stdout>:Hello, world, I am 22 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,20]<stdout>:Hello, world, I am 20 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,17]<stdout>:Hello, world, I am 17 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,19]<stdout>:Hello, world, I am 19 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,2]<stdout>:Hello, world, I am 2 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,4]<stdout>:Hello, world, I am 4 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,12]<stdout>:Hello, world, I am 12 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,0]<stdout>:Hello, world, I am 0 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,14]<stdout>:Hello, world, I am 14 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,3]<stdout>:Hello, world, I am 3 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,10]<stdout>:Hello, world, I am 10 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,23]<stdout>:Hello, world, I am 23 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,1]<stdout>:Hello, world, I am 1 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,7]<stdout>:Hello, world, I am 7 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,8]<stdout>:Hello, world, I am 8 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,5]<stdout>:Hello, world, I am 5 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,9]<stdout>:Hello, world, I am 9 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,6]<stdout>:Hello, world, I am 6 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,11]<stdout>:Hello, world, I am 11 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,13]<stdout>:Hello, world, I am 13 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,15]<stdout>:Hello, world, I am 15 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,21]<stdout>:Hello, world, I am 21 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
Wed Mar 18 09:23:19 2015[1,18]<stdout>:Hello, world, I am 18 of 24, (Open MPI v1.9a1, package: Open MPI miked@hpchead Distribution, ident: 1.9.0a1, repo rev: dev-1346-gb41d2ad6, Unreleased developer copy, 135)
[hpchead:13652] 85 more processes have sent help message help-opal-hwloc-base.txt / cpu-not-found
[hpchead:13652] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
salloc: Relinquishing job allocation 82676
salloc: Job allocation 82676 has been revoked.

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 18, 2015

@miked-mellanox I had to make an additional correction - see if this now works correctly for you

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/367/

Build Log
last 50 lines

[...truncated 33633 lines...]
+ '[' -n '' ']'
+ btl_openib=yes
+ btl_tcp=yes
+ btl_sm=yes
+ btl_vader=yes
++ echo /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1
+ for OMPI_HOME in '$(echo $ompi_home_list)'
+ echo 'check if mca_base_env_list parameter is supported in /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1'
check if mca_base_env_list parameter is supported in /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1
++ /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/ompi_info --param mca base --level 9
++ grep mca_base_env_list
++ wc -l
+ val=2
+ '[' 2 -gt 0 ']'
+ echo 'test mca_base_env_list option in /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1'
test mca_base_env_list option in /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1
+ export XXX_C=3 XXX_D=4 XXX_E=5
+ XXX_C=3
+ XXX_D=4
+ XXX_E=5
++ /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun -np 2 -mca mca_base_env_list 'XXX_A=1;XXX_B=2;XXX_C;XXX_D;XXX_E' env
++ grep '^XXX_'
++ wc -l
+ val=0
+ '[' 0 -ne 10 ']'
+ exit 1
Build step 'Execute shell' marked build as failure
TAP Reports Processing: START
Looking for TAP results report in workspace using pattern: **/*.tap
Saving reports...
Processing '/var/lib/jenkins/jobs/gh-ompi-master-pr/builds/367/tap-master-files/cov_stat.tap'
Parsing TAP test result [/var/lib/jenkins/jobs/gh-ompi-master-pr/builds/367/tap-master-files/cov_stat.tap].
not ok - coverity detected 917 failures in all_367 # SKIP http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/all_367/output/errors/index.html
not ok - coverity detected 5 failures in oshmem_367 # TODO http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/oshmem_367/output/errors/index.html
ok - coverity found no issues for yalla_367
ok - coverity found no issues for mxm_367
not ok - coverity detected 2 failures in fca_367 # TODO http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/fca_367/output/errors/index.html
ok - coverity found no issues for hcoll_367

TAP Reports Processing: FINISH
coverity_for_all    http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/all_367/output/errors/index.html
coverity_for_oshmem http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/oshmem_367/output/errors/index.html
coverity_for_fca    http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr//ws/cov_build/fca_367/output/errors/index.html
[copy-to-slave] The build is taking place on the master node, no copy back to the master will take place.
Setting commit status on GitHub for https://github.com/open-mpi/ompi/commit/5bb2d86c58436f389691cb1d441a6d849e6ea6ea
[BFA] Scanning build for known causes...

[BFA] Done. 0s
Setting status of ac04a78beb365731150c51d70b31c64c53b72bd8 to FAILURE with url http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr/367/ and message: Merged build finished.

Test FAILed.

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 18, 2015

@miked-mellanox Looks like your Jenkins tests are broken - this test has nothing to do with my changes.

@mike-dubman
Copy link
Member

but it fails only on this patchset and passes w/o it

$ export XXX_C=3 XXX_D=4 XXX_E=5
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun -np 2 -mca mca_base_env_list 'XXX_A=1;XXX_B=2;XXX_C;XXX_D;XXX_E' env
++ grep '^XXX_'
++ wc -l
+ val=0
+ '[' 0 -ne 10 ']'
``

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 18, 2015

Weird - it makes no sense. Maybe I'm not fully updated on something on this branch? I don't touch the envars anywhere in this, nor did I touch the mca system in any way.

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 18, 2015

Hmmm...I see the issue. Has nothing to do with that command. I'll work on it.

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/368/
Test PASSed.

@mike-dubman
Copy link
Member

with new patch, got this:

$salloc -p pivy -N 2 --ntasks-per-node=12 /labhome/miked/workspace/git/mellanox-hpc/ompi-release/debug-master-480/install/bin/mpirun  --cpu-set 12,13,14,15,16,17,18,19,20,21,22,23   --bind-to core --tag-output --timestamp-output --display-map  --map-by node -mca pml yalla   -x MXM_TLS=ud,self,shm -x MXM_RDMA_PORTS=mlx5_1:1 ~/workspace/git/mellanox-hpc/ompi-release/examples/hello_c
salloc: Granted job allocation 82798
salloc: Waiting for resource configuration
salloc: Nodes r-hp[01-02] are ready for job
[hpchead:23712] [[29402,0],0] ORTE_ERROR_LOG: Bad parameter in file ../../../../orte/mca/rmaps/base/rmaps_base_binding.c at line 745
[hpchead:23712] [[29402,0],0] ORTE_ERROR_LOG: Bad parameter in file ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 353
salloc: Relinquishing job allocation 82798
salloc: Job allocation 82798 has been revoked.

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 19, 2015

@miked-mellanox okay, try it again

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/369/
Test PASSed.

@mike-dubman
Copy link
Member

woo-hoo! works fine! thanks a lot.

@mike-dubman
Copy link
Member

could you please squash it into single commit?

@rhc54 rhc54 closed this Mar 19, 2015
@rhc54 rhc54 deleted the topic/topo branch March 19, 2015 23:32
jsquyres added a commit to jsquyres/ompi that referenced this pull request Nov 10, 2015
…leanup

plm/alps: remove unneeded env. variable setting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants