Move process name {jobid,vpid} down to the OPAL layer. #261

ggouaillardet · 2014-11-05T07:29:22Z

opal_process_name_t is now a struct :

    typedef uint32_t opal_jobid_t;
    typedef uint32_t opal_vpid_t;
    typedef struct {
        opal_jobid_t jobid;
        opal_vpid_t vpid;
    } opal_process_name_t;

new opal_proc_table_t class :
this is a hash table (key is jobid) of hash tables (key is vpid)
and is used to store per opal_process_name_t info.
new OPAL_NAME dss type

bosilca · 2014-11-05T18:27:36Z

This change looks extremely invasive, introduce new and extremely specialized containers forcing. Is there any benefit, performance or productivity to adopt this approach?

Personally, I think that the process_name should be an undefined entity, without clear specification. An accessor to get a rank out of it might be necessary in the current code, but clearly OPAL should not know what an job id is.

rhc54 · 2014-11-06T01:50:52Z

I think you may be misunderstanding the change, so let me try to explain the reasoning behind it, and offer a suggestion.

At the very beginning of the Open MPI project, we had a fairly lengthy discussion over the form of the process_name_t. At that time, all parties (including UTK) settled on the solution of a struct containing two 32-bit fields, one for a "jobid" and the other for a "vpid". Our rationale at that time was based on the desire to support heterogeneous situations, and for big/little endian systems.

This continued until we reached the point where two things occurred. First, you asked that I move the database framework to OPAL so that each layer could have its own db. When I did that, I created a non-structured "opal_identifier_t" comprised of a flat 64-bit field and mapped it on top of the structured process_name_t used by OMPI. This was a mistake on my part as it created a number of problems for hetero operations and alignment violations on big endian systems. However, we don't test those environments and so it went undetected for quite some time. In fairness, Siegmar did point out the issues, and we did try to bandaid them over time - yet maintaining the support proved troublesome, and we frankly misunderstood the root cause of many of his problem reports.

The error was further compounded when we moved the BTLs down to OPAL. At that time, you adopted my opal_identifier_t and extended its use even further, creating an additional ecosystem around it. In addition, we moved several other code areas down to OPAL to support the BTLs, including the modex operations (now embodied in the opal/pmix framework). These areas depend strongly on the process_name_t structure. In fact, we currently have to artificially define the structured form of the name in each of those code areas, populate them, and then memcpy them across into the opal_identifier_t. This is necessary because every resource manager we know about/support passes identifier info to us as a jobid (in various formats) and rank, and so we have to load the identifier using that info.

Even with those efforts, we continue to encounter problems with support for hetero operations and big-endian systems. The simplest solution we can find is to revert back to the original definition of the OMPI process_name_t, acknowledging that our original thinking was correct and that we made a mistake.

So rather than this change being something new, it actually is a reversion of a portion of a past commit that has proven to create a problem within the code base. I think we all understand that you have a desire to use the OMPI code base in some external project where an abstract opal_identifier_t would be helpful. What I would suggest is that you adopt the strategy previously employed in such circumstances, including when I've worked on external projects: add the abstraction on your side of the code.

In other words, instead of us continuing to fight the abstraction problem in the OMPI code base where it doesn't really help us, the abstraction should be dealt with in the external project. You would need to write your own wrapper functions to convert back/forth from the opal_process_name_t struct, but that isn't an overly burdensome requirement for re-use of the code base.

Note that nothing in OMPI cares about the actual values inside the jobid/vpid fields of the process_name_t - you are free to assign any value you like to them. We just need that identifier to be structured so we can properly ensure alignment everywhere they are used, and to ensure hetero operations are fully supported in a more easily maintainable way.

HTH
Ralph

mellanox-github · 2014-11-06T20:07:23Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/4/

Build Log
last 50 lines

[...truncated 15985 lines...]
make[3]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test/datatype'
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test/datatype'
Making install in util
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test/util'
make[3]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test/util'
make[3]: Nothing to be done for `install-exec-am'.
make[3]: Nothing to be done for `install-data-am'.
make[3]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test/util'
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test/util'
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test'
make[3]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test'
make[3]: Nothing to be done for `install-exec-am'.
make[3]: Nothing to be done for `install-data-am'.
make[3]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test'
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test'
make[1]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test'
make[1]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3'
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3'
make[2]: Nothing to be done for `install-exec-am'.
make[2]: Nothing to be done for `install-data-am'.
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3'
make[1]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3'
+ '[' -x /usr/bin/dpkg-buildpackage ']'
+ '[' -n yes ']'
+ '[' yes = yes ']'
+ '[' -f /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-3/contrib/check-help-strings.pl ']'
+ echo 'Checking help strings'
Checking help strings
++ echo oshmem ompi/mca/mtl/mxm ompi/mca/coll/fca ompi/mca/coll/hcoll
+ for dir in '$(echo $help_txt_list)'
+ '[' -d oshmem ']'
+ cd oshmem
+ /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-3/contrib/check-help-strings.pl .
Searching for source and help files...
Indexing help files (from entire source tree)...
Searching source files (under oshmem)...
Checking for stale help messages / files...
*** WARNING: Possibly unused help topic
  Help file: ./mca/sshmem/mmap/help-oshmem-sshmem-mmap.txt
  Help topic: mmap:file open failure
*** WARNING: Possibly unused help topic
  Help file: ./mca/sshmem/mmap/help-oshmem-sshmem-mmap.txt
  Help topic: mmap:file truncate failure
Total number of warnings: 2
Total number of errors: 0
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

jsquyres · 2014-11-06T20:09:49Z

Any way to have these not fail if help messages unrelated to the PR are causing issues with the detection script (particularly when the script seems to only be checking oshmem...)?

Also, the "details" link next to "Build finished" goes to a 404 on bgate.mellanox.com.

mike-dubman · 2014-11-06T20:29:24Z

yep, will fix it.

mike-dubman · 2014-11-06T20:40:19Z

ok,

"404" error is due to the fact that jenkins took pr from release repo (due to my mistake) and generated url as it was belonging to the "master" branch. (fixed)
disable "help.txt" check only if requested
shmem tests failing on trunk I think it is still unsupported due to PMIx changes, @elenash , @rhc54 - please confirm.
will re-trigger now

jsquyres · 2014-11-08T04:36:51Z

@ggouaillardet Looks like this PR has now become stale (can't be merged automatically). :-(

Can you refresh?

rhc54 · 2014-11-09T15:17:31Z

I've been working on the PMIx code release some more, and on the next phase of memory footprint reductions. In doing that, I believe I'm converging on a solution that may be more palatable to George while still resolving the hetero and SPARC issues.

However, it will still be "extremely invasive" as there is no way to resolve these things without causing disruption somewhere. George's approach is causing significant problems in the RTE-OPAL interface. The currently proposed solution removes those, but shifts the disruption to external re-users of OPAL such as George.

The revised approach would also be disruptive as it will cause significant changes to ORTE, and will cause at least some changes in OPAL. However, it will leave us with a non-structured "identifier", although perhaps only uint32_t instead of uint64_t to conserve memory.

So the question becomes:

do we leave hetero and SPARC support broken pending the alternative solution I'm looking at? Note that this won't be done for at least a month, and perhaps more like the end of the year. However, it would involve only one disruption.
do we fix that support now, and then revise the fix again later? If so, then we probably should use the current patch (once updated) as the immediate fix and then revisit it later. However, this creates two disruptive events in the code.

Neither of these fixes is going to backport to the 1.8 series, so it is purely a question of setting up for the 1.9 branch. Both paths would be in time for that event.

ggouaillardet · 2014-11-09T15:19:45Z

@jsquyres will do
@miked-mellanox the link to jenkins is broken, any hint on what is failing ?
(make check ? make tests ? something else ?

mike-dubman · 2014-11-09T17:50:22Z

@ggouaillardet - yep, it failed when I misconfigured jenkins 3d ago and fixed later.
started this PR manually: http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr/16/console

mike-dubman · 2014-11-09T18:29:48Z

seems openib btl is not working with this commit, this command fails

19:58:59 + timeout -s SIGKILL 3m mpirun -np 8 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -mca pml ob1 -mca btl self,openib /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c

rhc54 · 2014-11-09T18:41:32Z

Okay - well that is pretty useless since most of us don't have IB. How did it fail? Can you fix it?

mike-dubman · 2014-11-09T19:46:53Z

same command line on "master" branch works as expected.
I tried to apply https://github.com/open-mpi/ompi/pull/261.patch into clean master - but it fails on conflicts.
Can you rebase this PR on top of latest master and check if it gets better?

rhc54 · 2014-11-09T22:53:51Z

We could certainly check /* if only you would tell us what the heck is wrong */

ggouaillardet · 2014-11-10T01:41:15Z

I ll do the rebase from now and then test helloworld on an ib cluster.
In the mean time, any hint is welcome !

ggouaillardet · 2014-11-10T03:09:40Z

@miked-mellanox i rebased, did the required porting and updated my branch (and hence this PR).
could you please give it a try ?
i was able to run the same test on an ib cluster with the same command line.
btw, do i need something special (hardware ? software ?) to make MXM effective ?

mellanox-github · 2014-11-10T03:20:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/17/

Build Log
last 50 lines

[...truncated 6719 lines...]
  CC       mpool_sm_component.lo
  CCLD     mca_mpool_sm.la
make[3]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/mpool/sm'
make[3]: Nothing to be done for `install-exec-am'.
 /bin/mkdir -p '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi'
 /bin/sh ../../../../libtool   --mode=install /usr/bin/install -c   mca_mpool_sm.la '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi'
libtool: install: warning: relinking `mca_mpool_sm.la'
libtool: install: (cd /scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/mpool/sm; /bin/sh /scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/libtool  --silent --tag CC --mode=relink gcc -std=gnu99 -DNDEBUG -O3 -g -finline-functions -fno-strict-aliasing -pthread -module -avoid-version -export-dynamic -o mca_mpool_sm.la -rpath /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi mpool_sm_module.lo mpool_sm_component.lo /scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/common/sm/libmca_common_sm.la -lrt -lm -lutil -lm -lutil )
libtool: install: /usr/bin/install -c .libs/mca_mpool_sm.soT /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_mpool_sm.so
libtool: install: /usr/bin/install -c .libs/mca_mpool_sm.lai /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_mpool_sm.la
libtool: finish: PATH="/hpc/local/bin::/usr/local/bin:/bin:/usr/bin:/usr/sbin:/hpc/local/bin:/hpc/local/bin/:/hpc/local/bin/:/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/ibutils/bin:/sbin" ldconfig -n /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi
----------------------------------------------------------------------
Libraries have been installed in:
   /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
   - add LIBDIR to the `LD_LIBRARY_PATH' environment variable
     during execution
   - add LIBDIR to the `LD_RUN_PATH' environment variable
     during linking
   - use the `-Wl,-rpath -Wl,LIBDIR' linker flag
   - have your system administrator add LIBDIR to `/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
make[3]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/mpool/sm'
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/mpool/sm'
Making install in mca/pmix/s2
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/pmix/s2'
  CC       mca_pmix_s2_la-pmix_s2_component.lo
  CC       mca_pmix_s2_la-pmix_s2.lo
  CC       mca_pmix_s2_la-pmi2_pmap_parser.lo
pmix_s2.c: In function 's2_fence':
pmix_s2.c:473: error: incompatible type for argument 2 of 'opal_dstore.store'
pmix_s2.c:473: note: expected 'const struct opal_process_name_t *' but argument is of type 'opal_process_name_t'
make[2]: *** [mca_pmix_s2_la-pmix_s2.lo] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/pmix/s2'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal'
make: *** [install-recursive] Error 1
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

ggouaillardet · 2014-11-10T04:27:53Z

just fixed remaining error in pmix/s2, rebased and updated my branch

mellanox-github · 2014-11-10T04:38:48Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/18/

Build Log
last 50 lines

[...truncated 18695 lines...]
++ ibstat -l
+ for hca_dev in '$(ibstat -l)'
+ local hca=mlx4_0:1
+ local mca=
+ mca=' --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M'
+ mca=' --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M --mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1'
+ mca=' --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M --mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 --mca rmaps_base_dist_hca mlx4_0:1 --mca sshmem_verbs_hca_name mlx4_0:1'
+ '[' -f /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem ']'
+ echo 'Running /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem '
Running /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem 
+ timeout -s SIGKILL 3m oshrun -np 8 --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M --mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 --mca rmaps_base_dist_hca mlx4_0:1 --mca sshmem_verbs_hca_name mlx4_0:1 -mca pml ob1 -mca btl self,tcp /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem
WARNING: The mechanism by which environment variables are explicitly
passed to Open MPI has changed!

Specifically, beginning in the 1.9.x/2.0.x series, using "-x" to set
environment variables is deprecated.  Please use the
"mca_base_env_list" MCA parameter, instead.

For example, this invocation using the old "-x" mechanism:

    mpirun -x env_foo1=bar1 -x env_foo2=bar2 -x env_foo3 ...

is equivalent to this invocation using the new "mca_base_env_list"
mechanism:

    mpirun -mca mca_base_env_list 'env_foo1=bar1;env_foo2=bar2;env_foo3' ...
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem: symbol lookup error: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_spml_ikrit.so: undefined symbol: opal_process_name_vpid
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem: symbol lookup error: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_spml_ikrit.so: undefined symbol: opal_process_name_vpid
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem: symbol lookup error: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_spml_ikrit.so: undefined symbol: opal_process_name_vpid
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem: symbol lookup error: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_spml_ikrit.so: undefined symbol: opal_process_name_vpid
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem: symbol lookup error: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_spml_ikrit.so: undefined symbol: opal_process_name_vpid
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem: symbol lookup error: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_spml_ikrit.so: undefined symbol: opal_process_name_vpid
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem: symbol lookup error: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_spml_ikrit.so: undefined symbol: opal_process_name_vpid
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem: symbol lookup error: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_spml_ikrit.so: undefined symbol: opal_process_name_vpid
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
oshrun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[46589,1],5]
  Exit code:    127
--------------------------------------------------------------------------
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

mike-dubman · 2014-11-10T04:41:25Z

now it passed the MPI part (also previous failure passed ok now) but failed on OSHMEM API (oshrun helloworld)

ggouaillardet · 2014-11-10T04:49:57Z

@miked-mellanox
the error is in oshmem/mca/spml/ikrit
it cannot be built on my system since i do not have mxm_api.h
is this Mellanox proprietary software ? can it be downloaded (for free ?)

i will have a look and make a "blind" port ...

ggouaillardet · 2014-11-10T04:55:13Z

"blind" port commited ...

mellanox-github · 2014-11-10T05:08:44Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/19/

Build Log
last 50 lines

[...truncated 18841 lines...]
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):

  mca_memheap_base_select() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[jenkins01:02207] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:02218] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 2 (pid 2207, host=jenkins01) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

Local host: jenkins01
PID:        2207
--------------------------------------------------------------------------
[jenkins01:02214] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:02206] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:02209] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:02205] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:02216] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:02212] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
oshrun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[42617,1],7]
  Exit code:    255
--------------------------------------------------------------------------
[jenkins01:02203] 7 more processes have sent help message help-shmem-runtime.txt / shmem_init:startup:internal-failure
[jenkins01:02203] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[jenkins01:02203] 7 more processes have sent help message help-shmem-api.txt / shmem-abort
[jenkins01:02203] 7 more processes have sent help message help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

ggouaillardet · 2014-11-10T08:19:45Z

@miked-mellanox @elenash i just fixed a bug when vader is used without knem
(e.g. OMPI_MCA_btl_vader_single_copy_mechanism=none)

then i pushed the fix to the master, rebased my branch and "asked" jenkins to test again.
btw, will jenkins really test again or does it "obeys" only folks at mellanox ?

elenash · 2014-11-10T08:20:27Z

@ggouaillardet I just run it on master. Should I check it on your fork? Is this issue anyhow related to the issue in oshmem? As I understood, this issue is in master from last week.

ggouaillardet · 2014-11-10T08:21:55Z

@elenash is the root cause a very small /tmp filesystem ?
if yes, can you try again with -mca shmem_mmap_relocate_backing_file=1
/* use /dev/shm instead of /tmp */

ggouaillardet · 2014-11-10T08:25:51Z

@elenash the thing is @miked-mellanox and i observe different behaviors ...
it works for me with --mca btl tcp,self but fails for him
whereas it works for him with --mca btl vader,self but used to fail for me (and i just pushed a fix to the master, and the problem started for me four weeks ago )
i would like you to test the master first, and only if it is successful, then you can test my branch
thanks in advance !

mike-dubman · 2014-11-10T08:26:11Z

@ggouaillardet - jenkins is testing again on your command. (see last box in this thread with jenkins details)
http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr/22/

mellanox-github · 2014-11-10T08:30:42Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/22/

Build Log
last 50 lines

[...truncated 18685 lines...]
[jenkins01:16334] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):

  mca_memheap_base_select() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 16334, host=jenkins01) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

Local host: jenkins01
PID:        16334
--------------------------------------------------------------------------
[jenkins01:16335] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:16346] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:16338] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:16340] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:16343] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:16336] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:16344] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
oshrun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[37166,1],0]
  Exit code:    255
--------------------------------------------------------------------------
[jenkins01:16332] 7 more processes have sent help message help-shmem-runtime.txt / shmem_init:startup:internal-failure
[jenkins01:16332] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[jenkins01:16332] 7 more processes have sent help message help-shmem-api.txt / shmem-abort
[jenkins01:16332] 7 more processes have sent help message help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

mike-dubman · 2014-11-10T08:35:19Z

@alex-mikheev. @elenash - could you please comment on sshmem memheap failure reason?

elenash · 2014-11-10T08:53:09Z

@ggouaillardet Do you test it with mxm?

ggouaillardet · 2014-11-10T09:24:42Z

@elenash no (but i ll give it a try shortly)

elenash · 2014-11-10T09:31:00Z

@ggouaillardet master is working after your fix, thanks!
you branch is not. Maybe I didn't take all of your changes? I just did git clone https://github.com/ggouaillardet/ompi.git. oh I see, it was your master branch. I'll check out topic branch and try more :)

elenash · 2014-11-10T10:22:46Z

@ggouaillardet Something strange, but I still reproduce the issue with vader btl on your branch.

ggouaillardet · 2014-11-10T10:37:39Z

@elenash on my side, i get a hang with --mca btl tcp,self on the master
in MPI_Allgatherv
as a workaround, i export :
OMPI_MCA_coll_tuned_use_dynamic_rules=1
OMPI_MCA_coll_tuned_allgatherv_algorithm=3

i ll write a simplest mpi (e.g. non oshmem) reproducer and investigate
what is going wrong

alex-mikheev · 2014-11-10T11:43:26Z

The reason is that on master sshmem verbs memory allocator turns on ‘shared_mr’ on ConnectIB.
Later it fails because this feature is not supported.

I pushed the fix 097b469 into master

From: Mike Dubman [mailto:notifications@github.com]
Sent: Monday, November 10, 2014 10:35 AM
To: open-mpi/ompi
Cc: Alexander Mikheev
Subject: Re: [ompi] Move process name {jobid,vpid} down to the OPAL layer. (#261)

@alex-mikheevhttps://github.com/alex-mikheev. @elenashhttps://github.com/elenash - could you please comment on sshmem memheap failure reason?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/261#issuecomment-62354745.

mike-dubman · 2014-11-10T13:15:18Z

@ggouaillardet - could u please rebase on top of the Alex`s fix and let jenkins do the rest :)

ggouaillardet · 2014-11-11T03:15:02Z

retest this please

mellanox-github · 2014-11-11T03:25:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/23/

Build Log
last 50 lines

[...truncated 6722 lines...]
  CC       mpool_sm_component.lo
  CCLD     mca_mpool_sm.la
make[3]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/mpool/sm'
make[3]: Nothing to be done for `install-exec-am'.
 /bin/mkdir -p '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi'
 /bin/sh ../../../../libtool   --mode=install /usr/bin/install -c   mca_mpool_sm.la '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi'
libtool: install: warning: relinking `mca_mpool_sm.la'
libtool: install: (cd /scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/mpool/sm; /bin/sh /scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/libtool  --silent --tag CC --mode=relink gcc -std=gnu99 -DNDEBUG -O3 -g -finline-functions -fno-strict-aliasing -pthread -module -avoid-version -export-dynamic -o mca_mpool_sm.la -rpath /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi mpool_sm_module.lo mpool_sm_component.lo /scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/common/sm/libmca_common_sm.la -lrt -lm -lutil -lm -lutil )
libtool: install: /usr/bin/install -c .libs/mca_mpool_sm.soT /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_mpool_sm.so
libtool: install: /usr/bin/install -c .libs/mca_mpool_sm.lai /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_mpool_sm.la
libtool: finish: PATH="/hpc/local/bin::/usr/local/bin:/bin:/usr/bin:/usr/sbin:/hpc/local/bin:/hpc/local/bin/:/hpc/local/bin/:/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/ibutils/bin:/sbin" ldconfig -n /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi
----------------------------------------------------------------------
Libraries have been installed in:
   /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
   - add LIBDIR to the `LD_LIBRARY_PATH' environment variable
     during execution
   - add LIBDIR to the `LD_RUN_PATH' environment variable
     during linking
   - use the `-Wl,-rpath -Wl,LIBDIR' linker flag
   - have your system administrator add LIBDIR to `/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
make[3]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/mpool/sm'
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/mpool/sm'
Making install in mca/pmix/s2
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/pmix/s2'
  CC       mca_pmix_s2_la-pmix_s2_component.lo
  CC       mca_pmix_s2_la-pmix_s2.lo
  CC       mca_pmix_s2_la-pmi2_pmap_parser.lo
pmix_s2.c: In function 's2_fence':
pmix_s2.c:473: error: incompatible type for argument 2 of 'opal_dstore.store'
pmix_s2.c:473: note: expected 'const struct opal_process_name_t *' but argument is of type 'opal_process_name_t'
make[2]: *** [mca_pmix_s2_la-pmix_s2.lo] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/pmix/s2'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal'
make: *** [install-recursive] Error 1
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

* opal_process_name_t is now a struct : typedef uint32_t opal_jobid_t; typedef uint32_t opal_vpid_t; typedef struct { opal_jobid_t jobid; opal_vpid_t vpid; } opal_process_name_t; * new opal_proc_table_t class : this is a hash table (key is jobid) of hash tables (key is vpid) and is used to store per opal_process_name_t info. * new OPAL_NAME dss type This commit is co-authored by Ralph and Gilles.

ggouaillardet · 2014-11-11T03:43:25Z

retest this please

mellanox-github · 2014-11-11T03:53:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/24/

Build Log
last 50 lines

[...truncated 23083 lines...]
WARNING: The mechanism by which environment variables are explicitly
passed to Open MPI has changed!

Specifically, beginning in the 1.9.x/2.0.x series, using "-x" to set
environment variables is deprecated.  Please use the
"mca_base_env_list" MCA parameter, instead.

For example, this invocation using the old "-x" mechanism:

    mpirun -x env_foo1=bar1 -x env_foo2=bar2 -x env_foo3 ...

is equivalent to this invocation using the new "mca_base_env_list"
mechanism:

    mpirun -mca mca_base_env_list 'env_foo1=bar1;env_foo2=bar2;env_foo3' ...
6/8 dst = 7 8 9
2/8 dst = 7 8 9
4/8 dst = 7 8 9
7/8 dst = 7 8 9
0/8 dst = 7 8 9
5/8 dst = 7 8 9
3/8 dst = 7 8 9
1/8 dst = 7 8 9
+ '[' yes == yes ']'
+ timeout -s SIGKILL 3m oshrun -np 8 --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M --mca spml yoda -mca pml ob1 -mca btl self,vader /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/oshmem_max_reduction
WARNING: The mechanism by which environment variables are explicitly
passed to Open MPI has changed!

Specifically, beginning in the 1.9.x/2.0.x series, using "-x" to set
environment variables is deprecated.  Please use the
"mca_base_env_list" MCA parameter, instead.

For example, this invocation using the old "-x" mechanism:

    mpirun -x env_foo1=bar1 -x env_foo2=bar2 -x env_foo3 ...

is equivalent to this invocation using the new "mca_base_env_list"
mechanism:

    mpirun -mca mca_base_env_list 'env_foo1=bar1;env_foo2=bar2;env_foo3' ...
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 2 (pid 28074, host=jenkins01) with errorcode -1.
--------------------------------------------------------------------------
[jenkins01:27998] 7 more processes have sent help message help-shmem-api.txt / shmem-abort
[jenkins01:27998] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

mellanox-github · 2014-11-11T03:58:30Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/25/

Build Log
last 50 lines

[...truncated 23083 lines...]
WARNING: The mechanism by which environment variables are explicitly
passed to Open MPI has changed!

Specifically, beginning in the 1.9.x/2.0.x series, using "-x" to set
environment variables is deprecated.  Please use the
"mca_base_env_list" MCA parameter, instead.

For example, this invocation using the old "-x" mechanism:

    mpirun -x env_foo1=bar1 -x env_foo2=bar2 -x env_foo3 ...

is equivalent to this invocation using the new "mca_base_env_list"
mechanism:

    mpirun -mca mca_base_env_list 'env_foo1=bar1;env_foo2=bar2;env_foo3' ...
7/8 dst = 7 8 9
1/8 dst = 7 8 9
5/8 dst = 7 8 9
0/8 dst = 7 8 9
6/8 dst = 7 8 9
4/8 dst = 7 8 9
2/8 dst = 7 8 9
3/8 dst = 7 8 9
+ '[' yes == yes ']'
+ timeout -s SIGKILL 3m oshrun -np 8 --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M --mca spml yoda -mca pml ob1 -mca btl self,vader /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-2/ompi_install1/examples/oshmem_max_reduction
WARNING: The mechanism by which environment variables are explicitly
passed to Open MPI has changed!

Specifically, beginning in the 1.9.x/2.0.x series, using "-x" to set
environment variables is deprecated.  Please use the
"mca_base_env_list" MCA parameter, instead.

For example, this invocation using the old "-x" mechanism:

    mpirun -x env_foo1=bar1 -x env_foo2=bar2 -x env_foo3 ...

is equivalent to this invocation using the new "mca_base_env_list"
mechanism:

    mpirun -mca mca_base_env_list 'env_foo1=bar1;env_foo2=bar2;env_foo3' ...
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 5 (pid 8131, host=jenkins01) with errorcode -1.
--------------------------------------------------------------------------
[jenkins01:08120] 7 more processes have sent help message help-shmem-api.txt / shmem-abort
[jenkins01:08120] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

ggouaillardet · 2014-11-11T10:06:41Z

retest this please

mellanox-github · 2014-11-11T10:28:23Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/32/
Test PASSed.

rhc54 · 2014-11-12T01:06:44Z

Committed by Ralph per telecon discussion on 11/11/2014.

Some cleanup of warnings when building optimized

Fix the hardcoded path to singularity and add code documentation

ggouaillardet force-pushed the topic/opal_process_name branch 2 times, most recently from 992ee45 to 55f0bd3 Compare November 5, 2014 08:32

ggouaillardet mentioned this pull request Nov 6, 2014

move down jobid,vpid from ORTE to OPAL layer #249

Closed

ggouaillardet force-pushed the topic/opal_process_name branch from 55f0bd3 to 804e1d4 Compare November 10, 2014 03:07

ggouaillardet force-pushed the topic/opal_process_name branch from 804e1d4 to 991d618 Compare November 10, 2014 04:26

ggouaillardet force-pushed the topic/opal_process_name branch from 991d618 to 58b3e41 Compare November 10, 2014 04:52

ggouaillardet force-pushed the topic/opal_process_name branch from 58b3e41 to 668d152 Compare November 10, 2014 04:58

ggouaillardet force-pushed the topic/opal_process_name branch from 668d152 to 84d94a7 Compare November 11, 2014 03:14

ggouaillardet force-pushed the topic/opal_process_name branch from 84d94a7 to 668d152 Compare November 11, 2014 03:41

ggouaillardet force-pushed the topic/opal_process_name branch from 668d152 to ddb72d1 Compare November 11, 2014 03:43

rhc54 closed this Nov 12, 2014

tkordenbrock mentioned this pull request Nov 14, 2014

build of nightly master fails if configured with --enable-mpi-cxx #272

Closed

jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 21, 2016

Merge pull request open-mpi#261 from rhc54/cmr/cleanup

d436bb2

Some cleanup of warnings when building optimized

dong0321 pushed a commit to dong0321/ompi that referenced this pull request Feb 17, 2020

Merge pull request open-mpi#261 from gvallee/singularity_fixes

8651bda

Fix the hardcoded path to singularity and add code documentation

Move process name {jobid,vpid} down to the OPAL layer. #261

Move process name {jobid,vpid} down to the OPAL layer. #261

Uh oh!

Conversation

ggouaillardet commented Nov 5, 2014

Uh oh!

bosilca commented Nov 5, 2014

Uh oh!

rhc54 commented Nov 6, 2014

Uh oh!

mellanox-github commented Nov 6, 2014

Uh oh!

jsquyres commented Nov 6, 2014

Uh oh!

mike-dubman commented Nov 6, 2014

Uh oh!

mike-dubman commented Nov 6, 2014

Uh oh!

jsquyres commented Nov 8, 2014

Uh oh!

rhc54 commented Nov 9, 2014

Uh oh!

ggouaillardet commented Nov 9, 2014

Uh oh!

mike-dubman commented Nov 9, 2014

Uh oh!

mike-dubman commented Nov 9, 2014

Uh oh!

rhc54 commented Nov 9, 2014

Uh oh!

mike-dubman commented Nov 9, 2014

Uh oh!

rhc54 commented Nov 9, 2014

Uh oh!

ggouaillardet commented Nov 10, 2014

Uh oh!

ggouaillardet commented Nov 10, 2014

Uh oh!

mellanox-github commented Nov 10, 2014

Uh oh!

ggouaillardet commented Nov 10, 2014

Uh oh!

mellanox-github commented Nov 10, 2014

Uh oh!

mike-dubman commented Nov 10, 2014

Uh oh!

ggouaillardet commented Nov 10, 2014

Uh oh!

ggouaillardet commented Nov 10, 2014

Uh oh!

mellanox-github commented Nov 10, 2014

Uh oh!

ggouaillardet commented Nov 10, 2014

Uh oh!

elenash commented Nov 10, 2014

Uh oh!

ggouaillardet commented Nov 10, 2014

Uh oh!

ggouaillardet commented Nov 10, 2014

Uh oh!

mike-dubman commented Nov 10, 2014

Uh oh!

mellanox-github commented Nov 10, 2014

Uh oh!

mike-dubman commented Nov 10, 2014

Uh oh!

elenash commented Nov 10, 2014

Uh oh!

ggouaillardet commented Nov 10, 2014

Uh oh!

elenash commented Nov 10, 2014

Uh oh!

elenash commented Nov 10, 2014

Uh oh!

ggouaillardet commented Nov 10, 2014

Uh oh!

alex-mikheev commented Nov 10, 2014

Uh oh!

mike-dubman commented Nov 10, 2014

Uh oh!