Skip to content

hwloc 2.0 segfault in ec2/cfncluster on master #4027

@PeterGottesman

Description

@PeterGottesman

Background information

This was encountered on an EC2 cluster running in a slurm(16.05.3) allocation.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

This occurred on master(e79eb85).

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Ompi was installed from a git clone, then tarballed and distributed to all other nodes in a cluster.

Please describe the system on which you are running

  • Operating system/version: Amazon Linux
  • Computer hardware: EC2 C4.8Xlarge
  • Network type:

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

shell$ salloc -N 10
shell$ mpirun hostname
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

I have a core file for each orted, backtrace from one below
GDB backtrace:

#0  0x00007f89fbdd11d1 in hwloc__duplicate_object (newtopology=newtopology@entry=0x11d1280, newparent=newparent@entry=0x11d1690, newobj=0x11e7530, newobj@entry=0x0,
    src=src@entry=0x1112900) at ../../../../../../../opal/mca/hwloc/hwloc2a/hwloc/hwloc/topology.c:715
#1  0x00007f89fbdd12fb in hwloc__duplicate_object (newtopology=newtopology@entry=0x11d1280, newparent=newparent@entry=0x0, newobj=0x11d1690, src=src@entry=0x10fb2e0)
    at ../../../../../../../opal/mca/hwloc/hwloc2a/hwloc/hwloc/topology.c:735
#2  0x00007f89fbdd34a6 in opal_hwloc2a_hwloc__topology_dup (newp=newp@entry=0x7fff618ecb10, old=old@entry=0x10fae60, tma=tma@entry=0x7fff618ecb20)
    at ../../../../../../../opal/mca/hwloc/hwloc2a/hwloc/hwloc/topology.c:833
#3  0x00007f89fbdaf65c in opal_hwloc2a_hwloc_shmem_topology_get_length (topology=0x10fae60, lengthp=lengthp@entry=0x7f89f93e7fe8 <shmemsize>, flags=flags@entry=0)
    at ../../../../../../../opal/mca/hwloc/hwloc2a/hwloc/hwloc/shmem.c:68
#4  0x00007f89f91e646c in init () at ../../../../../orte/mca/rtc/hwloc/rtc_hwloc.c:104
#5  0x00007f89fc0a120b in orte_rtc_base_select () at ../../../../orte/mca/rtc/base/rtc_base_select.c:74
#6  0x00007f89fc0785d2 in orte_ess_base_orted_setup () at ../../../../orte/mca/ess/base/ess_base_std_orted.c:510
#7  0x00007f89f9dff115 in rte_init () at ../../../../../orte/mca/ess/slurm/ess_slurm_module.c:80
#8  0x00007f89fc0364d9 in orte_init (pargc=0x11e7760, pargc@entry=0x7fff618ede9c, pargv=0x40, pargv@entry=0x7fff618ede90, flags=0, flags@entry=2)
    at ../../orte/runtime/orte_init.c:273
#9  0x00007f89fc0596bd in orte_daemon (argc=argc@entry=19, argv=argv@entry=0x7fff618ee0f8) at ../../orte/orted/orted_main.c:350
#10 0x000000000040078a in main (argc=19, argv=0x7fff618ee0f8) at ../../../../orte/tools/orted/orted.c:60

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions