Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

orted crash with v4.0.5 and BUILTIN_GCC (GCC v7.3.1) #8268

Open
rajachan opened this issue Dec 4, 2020 · 1 comment
Open

orted crash with v4.0.5 and BUILTIN_GCC (GCC v7.3.1) #8268

rajachan opened this issue Dec 4, 2020 · 1 comment

Comments

@rajachan
Copy link
Member

rajachan commented Dec 4, 2020

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.0.5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

RPM built from the internal spec file.

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

N/A

Please describe the system on which you are running

  • Operating system/version: Amazon Linux 2 (Fedora-based)
  • Computer hardware: AWS Graviton2 instance (with Arm Neoverse-N1 cores)
  • Network type: EFA (so, OFI MTL)

Details of the problem

We are seeing occasional orted segfaults on this platform when running with a default configuration:

==== starting mpirun --prefix /opt/amazon/openmpi --wdir results/omb/collective/osu_allgatherv -n 2048 -N 64 --tag-output  --hostfile /fsx/hfile -x PATH -x LD_LIBRARY_PATH /fsx/dkothar/SubspaceBenchmarks/spack/opt/spack/linux-amzn2-aarch64/gcc-7.3.1/osu-micro-benchmarks-5.6-xmuoliterjpnfcnhn2wpapdpdisfrmrx/libexec/osu-micro-benchmarks/mpi/collective/osu_allgatherv -x 10 -i 10 : Mon Nov 30 14:52:04 UTC 2020 ====
[ip-172-31-15-226:13802] *** Process received signal ***
[ip-172-31-15-226:13802] Signal: Segmentation fault (11)
[ip-172-31-15-226:13802] Signal code: Address not mapped (1)
[ip-172-31-15-226:13802] Failing at address: (nil)
[ip-172-31-15-226:13802] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x4000202a5668]
[ip-172-31-15-226:13802] [ 1] /opt/amazon/openmpi/lib64/libopen-rte.so.40(orte_state_base_activate_proc_state+0xcc)[0x40002034f710]
[ip-172-31-15-226:13802] [ 2] /opt/amazon/openmpi/lib64/libopen-rte.so.40(orte_odls_base_spawn_proc+0x4fc)[0x40002032397c]
[ip-172-31-15-226:13802] [ 3] /opt/amazon/openmpi/lib64/libopen-pal.so.40(opal_libevent2022_event_base_loop+0xdb0)[0x40002041eed0]
[ip-172-31-15-226:13802] [ 4] /opt/amazon/openmpi/lib64/libopen-pal.so.40(+0x3e038)[0x4000203da038]
[ip-172-31-15-226:13802] [ 5] /lib64/libpthread.so.0(+0x72ac)[0x4000206562ac]
[ip-172-31-15-226:13802] [ 6] /lib64/libc.so.6(+0xd5e9c)[0x400020759e9c]
[ip-172-31-15-226:13802] *** End of error message ***
bash: line 1: 13802 Segmentation fault      (core dumped) /opt/amazon/openmpi/bin/orted -mca ess "env" -mca ess_base_jobid "2449276928" -mca ess_base_vpid 31 -mca ess_base_num_procs "32" -mca orte_node_regex "ip-[3:172]-31-4-217,[3:172].31.14.206,[3:172].31.10.112,[3:172].31.1.198,[3:172].31.8.200,[3:172].31.14.151,[3:172].31.7.206,[3:172].31.7.136,[3:172].31.2.252,[3:172].31.9.88,[3:172].31.13.44,[3:172].31.1.14,[3:172].31.9.249,[3:172].31.0.146,[3:172].31.3.111,[3:172].31.4.58,[3:172].31.12.94,[3:172].31.4.81,[3:172].31.4.249,[3:172].31.6.39,[3:172].31.7.103,[3:172].31.11.148,[3:172].31.0.23,[3:172].31.0.165,[3:172].31.5.196,[3:172].31.10.50,[3:172].31.11.232,[3:172].31.2.153,[3:172].31.3.106,[3:172].31.10.135,[3:172].31.14.82,[1:3].235.16.187@0(32)" -mca orte_hnp_uri "2449276928.0;tcp://172.31.4.217:37933" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "2449276928.0;tcp://172.31.4.217:37933" -mca rmaps_ppr_n_pernode "64" -mca orte_tag_output "1" -mca pmix "^s1,s2,cray,isolated"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[37373,0],0] on node ip-172-31-4-217
  Remote daemon: [[37373,0],31] on node 3.235.16.187

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
mpirun: Forwarding signal 18 to job
mpirun: Forwarding signal 18 to job
mpirun: Forwarding signal 18 to job
mpirun: Forwarding signal 18 to job
mpirun: Forwarding signal 18 to job
return status: 205

Looks like cls_constructor_array is NULL inside the opal thread. opal_cls_initialize() should have initialized this. The initialization seems to be protected by an atomic lock. After suspecting the atomic implementation on this platform, we disabled the builtins (--disable-builtin-atomics) and the issue does not seem to happen. By default in the 4.0.x branch, we are using BUILTIN_GCC atomics (from GCC v7.3.1 in this case). After disabling it, we use the arm64-specific assembly for atomic ops. The issue also does not happen with distros that have a newer GCC (> v9; like in Ubuntu 20) .

This failure been non-deterministic and hard to reproduce consistently. Do we have any known issues in this code path? I've gone through open issues related related to atomics and Arm and could not find anything in particular that might be causing this. Wanted to put feelers out while we continue to debug.

cc: @hjelmn

@rhc54
Copy link
Contributor

rhc54 commented Dec 4, 2020

I'm unaware of any problems down in there, but that doesn't mean something couldn't exist. The issue in the opal thread sounds very suspicious, however, especially if it works using different atomics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants