Skip to content

Move process name {jobid,vpid} down to the OPAL layer. #261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

ggouaillardet
Copy link
Contributor

  • opal_process_name_t is now a struct :
    typedef uint32_t opal_jobid_t;
    typedef uint32_t opal_vpid_t;
    typedef struct {
        opal_jobid_t jobid;
        opal_vpid_t vpid;
    } opal_process_name_t;
  • new opal_proc_table_t class :
    this is a hash table (key is jobid) of hash tables (key is vpid)
    and is used to store per opal_process_name_t info.
  • new OPAL_NAME dss type

@ggouaillardet ggouaillardet force-pushed the topic/opal_process_name branch 2 times, most recently from 992ee45 to 55f0bd3 Compare November 5, 2014 08:32
@bosilca
Copy link
Member

bosilca commented Nov 5, 2014

This change looks extremely invasive, introduce new and extremely specialized containers forcing. Is there any benefit, performance or productivity to adopt this approach?

Personally, I think that the process_name should be an undefined entity, without clear specification. An accessor to get a rank out of it might be necessary in the current code, but clearly OPAL should not know what an job id is.

@rhc54
Copy link
Contributor

rhc54 commented Nov 6, 2014

I think you may be misunderstanding the change, so let me try to explain the reasoning behind it, and offer a suggestion.

At the very beginning of the Open MPI project, we had a fairly lengthy discussion over the form of the process_name_t. At that time, all parties (including UTK) settled on the solution of a struct containing two 32-bit fields, one for a "jobid" and the other for a "vpid". Our rationale at that time was based on the desire to support heterogeneous situations, and for big/little endian systems.

This continued until we reached the point where two things occurred. First, you asked that I move the database framework to OPAL so that each layer could have its own db. When I did that, I created a non-structured "opal_identifier_t" comprised of a flat 64-bit field and mapped it on top of the structured process_name_t used by OMPI. This was a mistake on my part as it created a number of problems for hetero operations and alignment violations on big endian systems. However, we don't test those environments and so it went undetected for quite some time. In fairness, Siegmar did point out the issues, and we did try to bandaid them over time - yet maintaining the support proved troublesome, and we frankly misunderstood the root cause of many of his problem reports.

The error was further compounded when we moved the BTLs down to OPAL. At that time, you adopted my opal_identifier_t and extended its use even further, creating an additional ecosystem around it. In addition, we moved several other code areas down to OPAL to support the BTLs, including the modex operations (now embodied in the opal/pmix framework). These areas depend strongly on the process_name_t structure. In fact, we currently have to artificially define the structured form of the name in each of those code areas, populate them, and then memcpy them across into the opal_identifier_t. This is necessary because every resource manager we know about/support passes identifier info to us as a jobid (in various formats) and rank, and so we have to load the identifier using that info.

Even with those efforts, we continue to encounter problems with support for hetero operations and big-endian systems. The simplest solution we can find is to revert back to the original definition of the OMPI process_name_t, acknowledging that our original thinking was correct and that we made a mistake.

So rather than this change being something new, it actually is a reversion of a portion of a past commit that has proven to create a problem within the code base. I think we all understand that you have a desire to use the OMPI code base in some external project where an abstract opal_identifier_t would be helpful. What I would suggest is that you adopt the strategy previously employed in such circumstances, including when I've worked on external projects: add the abstraction on your side of the code.

In other words, instead of us continuing to fight the abstraction problem in the OMPI code base where it doesn't really help us, the abstraction should be dealt with in the external project. You would need to write your own wrapper functions to convert back/forth from the opal_process_name_t struct, but that isn't an overly burdensome requirement for re-use of the code base.

Note that nothing in OMPI cares about the actual values inside the jobid/vpid fields of the process_name_t - you are free to assign any value you like to them. We just need that identifier to be structured so we can properly ensure alignment everywhere they are used, and to ensure hetero operations are fully supported in a more easily maintainable way.

HTH
Ralph

@mellanox-github
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/4/

Build Log
last 50 lines

[...truncated 15985 lines...]
make[3]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test/datatype'
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test/datatype'
Making install in util
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test/util'
make[3]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test/util'
make[3]: Nothing to be done for `install-exec-am'.
make[3]: Nothing to be done for `install-data-am'.
make[3]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test/util'
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test/util'
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test'
make[3]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test'
make[3]: Nothing to be done for `install-exec-am'.
make[3]: Nothing to be done for `install-data-am'.
make[3]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test'
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test'
make[1]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3/test'
make[1]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3'
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3'
make[2]: Nothing to be done for `install-exec-am'.
make[2]: Nothing to be done for `install-data-am'.
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3'
make[1]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-3'
+ '[' -x /usr/bin/dpkg-buildpackage ']'
+ '[' -n yes ']'
+ '[' yes = yes ']'
+ '[' -f /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-3/contrib/check-help-strings.pl ']'
+ echo 'Checking help strings'
Checking help strings
++ echo oshmem ompi/mca/mtl/mxm ompi/mca/coll/fca ompi/mca/coll/hcoll
+ for dir in '$(echo $help_txt_list)'
+ '[' -d oshmem ']'
+ cd oshmem
+ /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-3/contrib/check-help-strings.pl .
Searching for source and help files...
Indexing help files (from entire source tree)...
Searching source files (under oshmem)...
Checking for stale help messages / files...
*** WARNING: Possibly unused help topic
  Help file: ./mca/sshmem/mmap/help-oshmem-sshmem-mmap.txt
  Help topic: mmap:file open failure
*** WARNING: Possibly unused help topic
  Help file: ./mca/sshmem/mmap/help-oshmem-sshmem-mmap.txt
  Help topic: mmap:file truncate failure
Total number of warnings: 2
Total number of errors: 0
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

@jsquyres
Copy link
Member

jsquyres commented Nov 6, 2014

Any way to have these not fail if help messages unrelated to the PR are causing issues with the detection script (particularly when the script seems to only be checking oshmem...)?

Also, the "details" link next to "Build finished" goes to a 404 on bgate.mellanox.com.

@mike-dubman
Copy link
Member

yep, will fix it.

@mike-dubman
Copy link
Member

ok,

  1. "404" error is due to the fact that jenkins took pr from release repo (due to my mistake) and generated url as it was belonging to the "master" branch. (fixed)
  2. disable "help.txt" check only if requested
  3. shmem tests failing on trunk I think it is still unsupported due to PMIx changes, @elenash , @rhc54 - please confirm.
  4. will re-trigger now

@jsquyres
Copy link
Member

jsquyres commented Nov 8, 2014

@ggouaillardet Looks like this PR has now become stale (can't be merged automatically). :-(

Can you refresh?

@rhc54
Copy link
Contributor

rhc54 commented Nov 9, 2014

I've been working on the PMIx code release some more, and on the next phase of memory footprint reductions. In doing that, I believe I'm converging on a solution that may be more palatable to George while still resolving the hetero and SPARC issues.

However, it will still be "extremely invasive" as there is no way to resolve these things without causing disruption somewhere. George's approach is causing significant problems in the RTE-OPAL interface. The currently proposed solution removes those, but shifts the disruption to external re-users of OPAL such as George.

The revised approach would also be disruptive as it will cause significant changes to ORTE, and will cause at least some changes in OPAL. However, it will leave us with a non-structured "identifier", although perhaps only uint32_t instead of uint64_t to conserve memory.

So the question becomes:

  • do we leave hetero and SPARC support broken pending the alternative solution I'm looking at? Note that this won't be done for at least a month, and perhaps more like the end of the year. However, it would involve only one disruption.
  • do we fix that support now, and then revise the fix again later? If so, then we probably should use the current patch (once updated) as the immediate fix and then revisit it later. However, this creates two disruptive events in the code.

Neither of these fixes is going to backport to the 1.8 series, so it is purely a question of setting up for the 1.9 branch. Both paths would be in time for that event.

@ggouaillardet
Copy link
Contributor Author

@jsquyres will do
@miked-mellanox the link to jenkins is broken, any hint on what is failing ?
(make check ? make tests ? something else ?

@mike-dubman
Copy link
Member

@ggouaillardet - yep, it failed when I misconfigured jenkins 3d ago and fixed later.
started this PR manually: http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr/16/console

@mike-dubman
Copy link
Member

seems openib btl is not working with this commit, this command fails

19:58:59 + timeout -s SIGKILL 3m mpirun -np 8 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -mca pml ob1 -mca btl self,openib /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c

@rhc54
Copy link
Contributor

rhc54 commented Nov 9, 2014

Okay - well that is pretty useless since most of us don't have IB. How did it fail? Can you fix it?

@mike-dubman
Copy link
Member

same command line on "master" branch works as expected.
I tried to apply https://github.com/open-mpi/ompi/pull/261.patch into clean master - but it fails on conflicts.
Can you rebase this PR on top of latest master and check if it gets better?

@rhc54
Copy link
Contributor

rhc54 commented Nov 9, 2014

We could certainly check /* if only you would tell us what the heck is wrong */

@ggouaillardet
Copy link
Contributor Author

I ll do the rebase from now and then test helloworld on an ib cluster.
In the mean time, any hint is welcome !

@ggouaillardet ggouaillardet force-pushed the topic/opal_process_name branch from 55f0bd3 to 804e1d4 Compare November 10, 2014 03:07
@ggouaillardet
Copy link
Contributor Author

@miked-mellanox i rebased, did the required porting and updated my branch (and hence this PR).
could you please give it a try ?
i was able to run the same test on an ib cluster with the same command line.
btw, do i need something special (hardware ? software ?) to make MXM effective ?

@mellanox-github
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/17/

Build Log
last 50 lines

[...truncated 6719 lines...]
  CC       mpool_sm_component.lo
  CCLD     mca_mpool_sm.la
make[3]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/mpool/sm'
make[3]: Nothing to be done for `install-exec-am'.
 /bin/mkdir -p '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi'
 /bin/sh ../../../../libtool   --mode=install /usr/bin/install -c   mca_mpool_sm.la '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi'
libtool: install: warning: relinking `mca_mpool_sm.la'
libtool: install: (cd /scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/mpool/sm; /bin/sh /scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/libtool  --silent --tag CC --mode=relink gcc -std=gnu99 -DNDEBUG -O3 -g -finline-functions -fno-strict-aliasing -pthread -module -avoid-version -export-dynamic -o mca_mpool_sm.la -rpath /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi mpool_sm_module.lo mpool_sm_component.lo /scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/common/sm/libmca_common_sm.la -lrt -lm -lutil -lm -lutil )
libtool: install: /usr/bin/install -c .libs/mca_mpool_sm.soT /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_mpool_sm.so
libtool: install: /usr/bin/install -c .libs/mca_mpool_sm.lai /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_mpool_sm.la
libtool: finish: PATH="/hpc/local/bin::/usr/local/bin:/bin:/usr/bin:/usr/sbin:/hpc/local/bin:/hpc/local/bin/:/hpc/local/bin/:/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/ibutils/bin:/sbin" ldconfig -n /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi
----------------------------------------------------------------------
Libraries have been installed in:
   /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
   - add LIBDIR to the `LD_LIBRARY_PATH' environment variable
     during execution
   - add LIBDIR to the `LD_RUN_PATH' environment variable
     during linking
   - use the `-Wl,-rpath -Wl,LIBDIR' linker flag
   - have your system administrator add LIBDIR to `/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
make[3]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/mpool/sm'
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/mpool/sm'
Making install in mca/pmix/s2
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/pmix/s2'
  CC       mca_pmix_s2_la-pmix_s2_component.lo
  CC       mca_pmix_s2_la-pmix_s2.lo
  CC       mca_pmix_s2_la-pmi2_pmap_parser.lo
pmix_s2.c: In function 's2_fence':
pmix_s2.c:473: error: incompatible type for argument 2 of 'opal_dstore.store'
pmix_s2.c:473: note: expected 'const struct opal_process_name_t *' but argument is of type 'opal_process_name_t'
make[2]: *** [mca_pmix_s2_la-pmix_s2.lo] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/pmix/s2'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal'
make: *** [install-recursive] Error 1
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

@ggouaillardet ggouaillardet force-pushed the topic/opal_process_name branch from 804e1d4 to 991d618 Compare November 10, 2014 04:26
@ggouaillardet
Copy link
Contributor Author

just fixed remaining error in pmix/s2, rebased and updated my branch

@mellanox-github
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/18/

Build Log
last 50 lines

[...truncated 18695 lines...]
++ ibstat -l
+ for hca_dev in '$(ibstat -l)'
+ local hca=mlx4_0:1
+ local mca=
+ mca=' --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M'
+ mca=' --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M --mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1'
+ mca=' --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M --mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 --mca rmaps_base_dist_hca mlx4_0:1 --mca sshmem_verbs_hca_name mlx4_0:1'
+ '[' -f /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem ']'
+ echo 'Running /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem '
Running /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem 
+ timeout -s SIGKILL 3m oshrun -np 8 --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M --mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 --mca rmaps_base_dist_hca mlx4_0:1 --mca sshmem_verbs_hca_name mlx4_0:1 -mca pml ob1 -mca btl self,tcp /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem
WARNING: The mechanism by which environment variables are explicitly
passed to Open MPI has changed!

Specifically, beginning in the 1.9.x/2.0.x series, using "-x" to set
environment variables is deprecated.  Please use the
"mca_base_env_list" MCA parameter, instead.

For example, this invocation using the old "-x" mechanism:

    mpirun -x env_foo1=bar1 -x env_foo2=bar2 -x env_foo3 ...

is equivalent to this invocation using the new "mca_base_env_list"
mechanism:

    mpirun -mca mca_base_env_list 'env_foo1=bar1;env_foo2=bar2;env_foo3' ...
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem: symbol lookup error: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_spml_ikrit.so: undefined symbol: opal_process_name_vpid
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem: symbol lookup error: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_spml_ikrit.so: undefined symbol: opal_process_name_vpid
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem: symbol lookup error: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_spml_ikrit.so: undefined symbol: opal_process_name_vpid
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem: symbol lookup error: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_spml_ikrit.so: undefined symbol: opal_process_name_vpid
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem: symbol lookup error: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_spml_ikrit.so: undefined symbol: opal_process_name_vpid
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem: symbol lookup error: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_spml_ikrit.so: undefined symbol: opal_process_name_vpid
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem: symbol lookup error: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_spml_ikrit.so: undefined symbol: opal_process_name_vpid
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_oshmem: symbol lookup error: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_spml_ikrit.so: undefined symbol: opal_process_name_vpid
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
oshrun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[46589,1],5]
  Exit code:    127
--------------------------------------------------------------------------
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

@mike-dubman
Copy link
Member

now it passed the MPI part (also previous failure passed ok now) but failed on OSHMEM API (oshrun helloworld)

@ggouaillardet
Copy link
Contributor Author

@miked-mellanox
the error is in oshmem/mca/spml/ikrit
it cannot be built on my system since i do not have mxm_api.h
is this Mellanox proprietary software ? can it be downloaded (for free ?)

i will have a look and make a "blind" port ...

@ggouaillardet ggouaillardet force-pushed the topic/opal_process_name branch from 991d618 to 58b3e41 Compare November 10, 2014 04:52
@ggouaillardet
Copy link
Contributor Author

"blind" port commited ...

@ggouaillardet ggouaillardet force-pushed the topic/opal_process_name branch from 58b3e41 to 668d152 Compare November 10, 2014 04:58
@mellanox-github
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/19/

Build Log
last 50 lines

[...truncated 18841 lines...]
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):

  mca_memheap_base_select() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[jenkins01:02207] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:02218] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 2 (pid 2207, host=jenkins01) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

Local host: jenkins01
PID:        2207
--------------------------------------------------------------------------
[jenkins01:02214] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:02206] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:02209] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:02205] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:02216] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:02212] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
oshrun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[42617,1],7]
  Exit code:    255
--------------------------------------------------------------------------
[jenkins01:02203] 7 more processes have sent help message help-shmem-runtime.txt / shmem_init:startup:internal-failure
[jenkins01:02203] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[jenkins01:02203] 7 more processes have sent help message help-shmem-api.txt / shmem-abort
[jenkins01:02203] 7 more processes have sent help message help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

@ggouaillardet
Copy link
Contributor Author

@miked-mellanox @elenash i just fixed a bug when vader is used without knem
(e.g. OMPI_MCA_btl_vader_single_copy_mechanism=none)

then i pushed the fix to the master, rebased my branch and "asked" jenkins to test again.
btw, will jenkins really test again or does it "obeys" only folks at mellanox ?

@elenash
Copy link
Contributor

elenash commented Nov 10, 2014

@ggouaillardet I just run it on master. Should I check it on your fork? Is this issue anyhow related to the issue in oshmem? As I understood, this issue is in master from last week.

@ggouaillardet
Copy link
Contributor Author

@elenash is the root cause a very small /tmp filesystem ?
if yes, can you try again with -mca shmem_mmap_relocate_backing_file=1
/* use /dev/shm instead of /tmp */

@ggouaillardet
Copy link
Contributor Author

@elenash the thing is @miked-mellanox and i observe different behaviors ...
it works for me with --mca btl tcp,self but fails for him
whereas it works for him with --mca btl vader,self but used to fail for me (and i just pushed a fix to the master, and the problem started for me four weeks ago )
i would like you to test the master first, and only if it is successful, then you can test my branch
thanks in advance !

@mike-dubman
Copy link
Member

@ggouaillardet - jenkins is testing again on your command. (see last box in this thread with jenkins details)
http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr/22/

@mellanox-github
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/22/

Build Log
last 50 lines

[...truncated 18685 lines...]
[jenkins01:16334] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
--------------------------------------------------------------------------
It looks like SHMEM_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open SHMEM
developer):

  mca_memheap_base_select() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 16334, host=jenkins01) with errorcode -1.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A SHMEM process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

Local host: jenkins01
PID:        16334
--------------------------------------------------------------------------
[jenkins01:16335] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:16346] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:16338] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:16340] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:16343] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:16336] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
[jenkins01:16344] Error: pshmem_init.c:61 - shmem_init() SHMEM failed to initialize - aborting
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
oshrun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[37166,1],0]
  Exit code:    255
--------------------------------------------------------------------------
[jenkins01:16332] 7 more processes have sent help message help-shmem-runtime.txt / shmem_init:startup:internal-failure
[jenkins01:16332] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[jenkins01:16332] 7 more processes have sent help message help-shmem-api.txt / shmem-abort
[jenkins01:16332] 7 more processes have sent help message help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

@mike-dubman
Copy link
Member

@alex-mikheev. @elenash - could you please comment on sshmem memheap failure reason?

@elenash
Copy link
Contributor

elenash commented Nov 10, 2014

@ggouaillardet Do you test it with mxm?

@ggouaillardet
Copy link
Contributor Author

@elenash no (but i ll give it a try shortly)

@elenash
Copy link
Contributor

elenash commented Nov 10, 2014

@ggouaillardet master is working after your fix, thanks!
you branch is not. Maybe I didn't take all of your changes? I just did git clone https://github.com/ggouaillardet/ompi.git. oh I see, it was your master branch. I'll check out topic branch and try more :)

@elenash
Copy link
Contributor

elenash commented Nov 10, 2014

@ggouaillardet Something strange, but I still reproduce the issue with vader btl on your branch.

@ggouaillardet
Copy link
Contributor Author

@elenash on my side, i get a hang with --mca btl tcp,self on the master
in MPI_Allgatherv
as a workaround, i export :
OMPI_MCA_coll_tuned_use_dynamic_rules=1
OMPI_MCA_coll_tuned_allgatherv_algorithm=3

i ll write a simplest mpi (e.g. non oshmem) reproducer and investigate
what is going wrong

@alex-mikheev
Copy link
Contributor

The reason is that on master sshmem verbs memory allocator turns on ‘shared_mr’ on ConnectIB.
Later it fails because this feature is not supported.

I pushed the fix 097b469 into master

From: Mike Dubman [mailto:notifications@github.com]
Sent: Monday, November 10, 2014 10:35 AM
To: open-mpi/ompi
Cc: Alexander Mikheev
Subject: Re: [ompi] Move process name {jobid,vpid} down to the OPAL layer. (#261)

@alex-mikheevhttps://github.com/alex-mikheev. @elenashhttps://github.com/elenash - could you please comment on sshmem memheap failure reason?


Reply to this email directly or view it on GitHubhttps://github.com//pull/261#issuecomment-62354745.

@mike-dubman
Copy link
Member

@ggouaillardet - could u please rebase on top of the Alex`s fix and let jenkins do the rest :)

@ggouaillardet ggouaillardet force-pushed the topic/opal_process_name branch from 668d152 to 84d94a7 Compare November 11, 2014 03:14
@ggouaillardet
Copy link
Contributor Author

retest this please

@mellanox-github
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/23/

Build Log
last 50 lines

[...truncated 6722 lines...]
  CC       mpool_sm_component.lo
  CCLD     mca_mpool_sm.la
make[3]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/mpool/sm'
make[3]: Nothing to be done for `install-exec-am'.
 /bin/mkdir -p '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi'
 /bin/sh ../../../../libtool   --mode=install /usr/bin/install -c   mca_mpool_sm.la '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi'
libtool: install: warning: relinking `mca_mpool_sm.la'
libtool: install: (cd /scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/mpool/sm; /bin/sh /scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/libtool  --silent --tag CC --mode=relink gcc -std=gnu99 -DNDEBUG -O3 -g -finline-functions -fno-strict-aliasing -pthread -module -avoid-version -export-dynamic -o mca_mpool_sm.la -rpath /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi mpool_sm_module.lo mpool_sm_component.lo /scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/common/sm/libmca_common_sm.la -lrt -lm -lutil -lm -lutil )
libtool: install: /usr/bin/install -c .libs/mca_mpool_sm.soT /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_mpool_sm.so
libtool: install: /usr/bin/install -c .libs/mca_mpool_sm.lai /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi/mca_mpool_sm.la
libtool: finish: PATH="/hpc/local/bin::/usr/local/bin:/bin:/usr/bin:/usr/sbin:/hpc/local/bin:/hpc/local/bin/:/hpc/local/bin/:/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/ibutils/bin:/sbin" ldconfig -n /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi
----------------------------------------------------------------------
Libraries have been installed in:
   /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib/openmpi

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
   - add LIBDIR to the `LD_LIBRARY_PATH' environment variable
     during execution
   - add LIBDIR to the `LD_RUN_PATH' environment variable
     during linking
   - use the `-Wl,-rpath -Wl,LIBDIR' linker flag
   - have your system administrator add LIBDIR to `/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
make[3]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/mpool/sm'
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/mpool/sm'
Making install in mca/pmix/s2
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/pmix/s2'
  CC       mca_pmix_s2_la-pmix_s2_component.lo
  CC       mca_pmix_s2_la-pmix_s2.lo
  CC       mca_pmix_s2_la-pmi2_pmap_parser.lo
pmix_s2.c: In function 's2_fence':
pmix_s2.c:473: error: incompatible type for argument 2 of 'opal_dstore.store'
pmix_s2.c:473: note: expected 'const struct opal_process_name_t *' but argument is of type 'opal_process_name_t'
make[2]: *** [mca_pmix_s2_la-pmix_s2.lo] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/pmix/s2'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal'
make: *** [install-recursive] Error 1
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

@ggouaillardet ggouaillardet force-pushed the topic/opal_process_name branch from 84d94a7 to 668d152 Compare November 11, 2014 03:41
* opal_process_name_t is now a struct :
    typedef uint32_t opal_jobid_t;
    typedef uint32_t opal_vpid_t;
    typedef struct {
        opal_jobid_t jobid;
        opal_vpid_t vpid;
    } opal_process_name_t;

* new opal_proc_table_t class :
  this is a hash table (key is jobid) of hash tables (key is vpid)
  and is used to store per opal_process_name_t info.

* new OPAL_NAME dss type

This commit is co-authored by Ralph and Gilles.
@ggouaillardet ggouaillardet force-pushed the topic/opal_process_name branch from 668d152 to ddb72d1 Compare November 11, 2014 03:43
@ggouaillardet
Copy link
Contributor Author

retest this please

@mellanox-github
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/24/

Build Log
last 50 lines

[...truncated 23083 lines...]
WARNING: The mechanism by which environment variables are explicitly
passed to Open MPI has changed!

Specifically, beginning in the 1.9.x/2.0.x series, using "-x" to set
environment variables is deprecated.  Please use the
"mca_base_env_list" MCA parameter, instead.

For example, this invocation using the old "-x" mechanism:

    mpirun -x env_foo1=bar1 -x env_foo2=bar2 -x env_foo3 ...

is equivalent to this invocation using the new "mca_base_env_list"
mechanism:

    mpirun -mca mca_base_env_list 'env_foo1=bar1;env_foo2=bar2;env_foo3' ...
6/8 dst = 7 8 9
2/8 dst = 7 8 9
4/8 dst = 7 8 9
7/8 dst = 7 8 9
0/8 dst = 7 8 9
5/8 dst = 7 8 9
3/8 dst = 7 8 9
1/8 dst = 7 8 9
+ '[' yes == yes ']'
+ timeout -s SIGKILL 3m oshrun -np 8 --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M --mca spml yoda -mca pml ob1 -mca btl self,vader /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/oshmem_max_reduction
WARNING: The mechanism by which environment variables are explicitly
passed to Open MPI has changed!

Specifically, beginning in the 1.9.x/2.0.x series, using "-x" to set
environment variables is deprecated.  Please use the
"mca_base_env_list" MCA parameter, instead.

For example, this invocation using the old "-x" mechanism:

    mpirun -x env_foo1=bar1 -x env_foo2=bar2 -x env_foo3 ...

is equivalent to this invocation using the new "mca_base_env_list"
mechanism:

    mpirun -mca mca_base_env_list 'env_foo1=bar1;env_foo2=bar2;env_foo3' ...
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 2 (pid 28074, host=jenkins01) with errorcode -1.
--------------------------------------------------------------------------
[jenkins01:27998] 7 more processes have sent help message help-shmem-api.txt / shmem-abort
[jenkins01:27998] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

@mellanox-github
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/25/

Build Log
last 50 lines

[...truncated 23083 lines...]
WARNING: The mechanism by which environment variables are explicitly
passed to Open MPI has changed!

Specifically, beginning in the 1.9.x/2.0.x series, using "-x" to set
environment variables is deprecated.  Please use the
"mca_base_env_list" MCA parameter, instead.

For example, this invocation using the old "-x" mechanism:

    mpirun -x env_foo1=bar1 -x env_foo2=bar2 -x env_foo3 ...

is equivalent to this invocation using the new "mca_base_env_list"
mechanism:

    mpirun -mca mca_base_env_list 'env_foo1=bar1;env_foo2=bar2;env_foo3' ...
7/8 dst = 7 8 9
1/8 dst = 7 8 9
5/8 dst = 7 8 9
0/8 dst = 7 8 9
6/8 dst = 7 8 9
4/8 dst = 7 8 9
2/8 dst = 7 8 9
3/8 dst = 7 8 9
+ '[' yes == yes ']'
+ timeout -s SIGKILL 3m oshrun -np 8 --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M --mca spml yoda -mca pml ob1 -mca btl self,vader /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-2/ompi_install1/examples/oshmem_max_reduction
WARNING: The mechanism by which environment variables are explicitly
passed to Open MPI has changed!

Specifically, beginning in the 1.9.x/2.0.x series, using "-x" to set
environment variables is deprecated.  Please use the
"mca_base_env_list" MCA parameter, instead.

For example, this invocation using the old "-x" mechanism:

    mpirun -x env_foo1=bar1 -x env_foo2=bar2 -x env_foo3 ...

is equivalent to this invocation using the new "mca_base_env_list"
mechanism:

    mpirun -mca mca_base_env_list 'env_foo1=bar1;env_foo2=bar2;env_foo3' ...
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 5 (pid 8131, host=jenkins01) with errorcode -1.
--------------------------------------------------------------------------
[jenkins01:08120] 7 more processes have sent help message help-shmem-api.txt / shmem-abort
[jenkins01:08120] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

@ggouaillardet
Copy link
Contributor Author

retest this please

@mellanox-github
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/32/
Test PASSed.

@rhc54
Copy link
Contributor

rhc54 commented Nov 12, 2014

Committed by Ralph per telecon discussion on 11/11/2014.

@rhc54 rhc54 closed this Nov 12, 2014
jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 21, 2016
Some cleanup of warnings when building optimized
dong0321 pushed a commit to dong0321/ompi that referenced this pull request Feb 17, 2020
Fix the hardcoded path to singularity and add code documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants