Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Jun 20, 2016

Missing commits from patcher and request rework code.

Fixes open-mpi/ompi#1794
Fixes open-mpi/ompi#1795

:bot🏷️bug
:bot:milestone:v2.0.0
:bot:assign: @jsquyres

hjelmn and others added 4 commits June 20, 2016 15:19
The opal_mem_hooks_release_hook does not have const on the pointer
(though it probably should). This commit eliminates a warning by
casting away the const until opal_mem_hooks_release_hook is updated to
use const.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from open-mpi/ompi@5612998)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit moves the patcher framework initialization to the
memory/patcher component.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from open-mpi/ompi@41f00b7)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Thanks to Paul Hargrove for reporting.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

(cherry picked from open-mpi/ompi@acbd2c6)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix warnings introduced by request rework.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
(cherry picked from open-mpi/ompi@b001184)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@ompiteam-bot ompiteam-bot added this to the v2.0.0 milestone Jun 20, 2016
@jsquyres
Copy link
Member

This PR fixes open-mpi/ompi#1794 and open-mpi/ompi#1795.

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1793/ for details.

@jsquyres
Copy link
Member

@hjelmn Looks like this caused a legit segv at the Mellanox jenkins:

18:03:32 + taskset -c 6,7 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/mpirun -np 4 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -x UCX_NET_DEVICES=mlx4_0:1 -x UCX_TLS=rc,cm -mca pml yalla /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/thread_tests/thread-tests-1.1/overlap 8
18:03:32 Time per iteration on each process (ms)
18:03:32 Time    Compute time    Comm time
18:03:34 [jenkins01:12443:0] Caught signal 11 (Segmentation fault)
18:03:34 --------------------------------------------------------------------------
18:03:34 mpirun noticed that process rank 1 with PID 0 on node jenkins01 exited on signal 11 (Segmentation fault).
18:03:34 --------------------------------------------------------------------------

@jladd-mlnx @Di0gen Could we get a corefile backtrace, perchance? Thanks!

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1795/ for details.

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1796/ for details.

@jsquyres
Copy link
Member

@jladd-mlnx Both this PR and #1238 are failing with this stack trace in the thread-tests-1.1/overlap test:

11:54:44 + taskset -c 0,1 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/bin/mpirun -np 4 -bind-to core -mca btl_openib_receive_queues P,65536,256,192,128:S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64 -mca btl_openib_cpc_include rdmacm -mca pml '^ucx' -mca btl self,openib -mca btl_if_include mlx4_0:2 /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/thread_tests/thread-tests-1.1/overlap 8
11:54:45 Time per iteration on each process (ms)
11:54:45 Time    Compute time    Comm time
11:54:47 [jenkins01:10915:0] Caught signal 11 (Segmentation fault)
11:54:47 ==== backtrace ====
11:54:47  2 0x000000000005a42c mxm_handle_error()  /hpc/local/benchmarks/hpc-stack-gcc-Monday/src/install/mxm-master/src/mxm/util/debug/debug.c:641
11:54:47  3 0x000000000005a59c mxm_error_signal_handler()  /hpc/local/benchmarks/hpc-stack-gcc-Monday/src/install/mxm-master/src/mxm/util/debug/debug.c:616
11:54:47  4 0x0000003d690329a0 killpg()  ??:0
11:54:47  5 0x0000000000040589 hmca_coll_ml_barrier_launch()  coll_ml_barrier.c:0
11:54:47  6 0x00000000000410a9 hmca_coll_ml_barrier_intra()  ??:0
11:54:47  7 0x0000000000008a2a mca_coll_hcoll_barrier()  /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi/mca/coll/hcoll/coll_hcoll_ops.c:28
11:54:47  8 0x0000000000078ca6 PMPI_Barrier()  /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi/mpi/c/profile/pbarrier.c:61
11:54:47  9 0x0000000000401009 main()  ??:0
11:54:47 10 0x0000003d6901ed1d __libc_start_main()  ??:0
11:54:47 11 0x0000000000400b69 _start()  ??:0
11:54:47 ===================

But this stack trace implies that there's nothing going on with requests (i.e., this possibly isn't related to the request rework). Also, #1238 is a one-sided thing, not a request thing -- but it is also failing in MPI_Barrier.

Was there a change in the hcoll stack on the jenkins machine recently, perchance? (it looks like MXM in the stack might be a red herring -- looks like it's just the segv handler, because it was an openib BTL run...?)

@jladd-mlnx
Copy link
Member

@jsquyres @hjelmn Looks like a race condition to me. I ran it five times with no issues, and on the sixth it segfaulted. This is with Yalla. Now it's hanging. I can give access if you'd like.

Time     Compute time    Comm time
[jenkins01:11916:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
 2 0x000000000005a42c mxm_handle_error()  /var/tmp/OFED_topdir/BUILD/mxm-3.5.3092/src/mxm/util/debug/debug.c:641
 3 0x000000000005a59c mxm_error_signal_handler()  /var/tmp/OFED_topdir/BUILD/mxm-3.5.3092/src/mxm/util/debug/debug.c:616
 4 0x0000003d690329a0 killpg()  ??:0
 5 0x0000000000003637 opal_datatype_is_contiguous_memory_layout()  /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi/mca/pml/yalla/../../../../opal/datatype/opal_datatype.h:217
 6 0x0000000000005760 mca_pml_yalla_irecv()  /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi/mca/pml/yalla/pml_yalla.c:348
 7 0x000000000009bf4d PMPI_Irecv()  /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi/mpi/c/profile/pirecv.c:78
 8 0x0000000000400ef0 main()  ??:0
 9 0x0000003d6901ed1d __libc_start_main()  ??:0
10 0x0000000000400b69 _start()  ??:0
===================
[jenkins01:11907] 3 more processes have sent help message help-mpi-btl-openib.txt / default subnet prefix
[jenkins01:11907] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 0 on node jenkins01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

@jsquyres
Copy link
Member

@jladd-mlnx Can you bisect?

@jladd-mlnx
Copy link
Member

@artpol84 Could you please have Boris take a look. Just ssh to jenkins01 on Bgate and copy paste the command line.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 23, 2016

@jsquyres This one probably should go in as well. Want to get an MTT run in tonight.

@jsquyres
Copy link
Member

@hppritcha I'm ok with this one going in, too.

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1816/ for details.

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1819/ for details.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants