Regression test suite with MPI causes nonblocking IO write errors on macOS #47

felker · 2017-12-05T20:15:49Z

This is a subtle problem that I have noticed on my MacBook Pro and when using Travis CI with osx VM builds over the past few months. I have just now diagnosed the issue, but it would be good if others can reproduce this and discuss a workaround (if full macOS support is desired).

Bug report

Bug summary

The make() command in tst/regression/scripts/utils/athena.py, invoked by a regression test module's prepare() step fails in the linker with

make: write error: stdout

being written in the middle of the stdout stream.

The challenge in tracking down this bug is that it manifests in any regression test only after an MPI-enabled regression test (which itself passes). Currently, the only regression test scripts that use MPI are:

mpi/mpi_linwave.py
grav/jeans_3d.py

Code for reproduction

cd tst/regression
python run_tests.py mpi/mpi_linwave hydro/sod_shock

Or, another example:

cd tst/regression
python run_tests.py grav/jeans_3d outputs/all_outputs

The error only occurs in the make commands invoked in the separate test scripts following the MPI regression test. In other words, even though the mpi_linwave.py first compiles an MPI binary and then a serial binary in the same module, there is no write error in that test.

Furthermore, the bug only occurs either when:

Both tests are run in the same Python command, as above
The multiple python run_tests.py testname commands are executed in the same process/script.

So, command line execution of

cd tst/regression
python run_tests.py grav/jeans_3d
python outputs/all_outputs

works fine, but running those commands in a Bash script will produce the error. Hence, in VM environments like Travis CI, which wraps the user's commands in a command called travis_run_script, this issue may appear.

Actual outcome

...

g++  -O3 -o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/bin/athena /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/main.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/globals.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/parameter_input.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/get_boundary_flag.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/reflect.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/bvals_mg.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/bvals_cc.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/polarwedge.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/bvals_base.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/flux_correction_fc.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/bvals_grav.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/flux_correction_cc.o /Users/kfelker/Desktop/athena-trunk-clean/tsmake: write error: stdout
Traceback (most recent call last):
  File "run_tests.py", line 76, in main
    module.prepare()
  File "/Users/kfelker/Desktop/athena-trunk-clean/tst/regression/scripts/tests/hydro/hydro_linwave.py", line 21, in prepare
    athena.make()
  File "/Users/kfelker/Desktop/athena-trunk-clean/tst/regression/scripts/utils/athena.py", line 46, in make
    .format(err.returncode,' '.join(err.cmd)))
AthenaError: Return code 1 from command 'make -j EXE_DIR:=/Users/kfelker/Desktop/athena-trunk-clean/tst/regression/bin/ OBJ_DIR:=/Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/'
---> Error in scripts/tests/hydro/hydro_linwave.py

Results:
    mpi.mpi_linwave: passed
    hydro.sod_shock: failed -- unexpected failure in prepare() stage
    hydro.hydro_linwave: failed -- unexpected failure in prepare() stage

Summary: 1 out of 3 tests passed

I have also observed:

IOError: [Errno 35] Resource temporarily unavailable

instead of make: write error when using alternative subprocess commands; see below.

Version Information

My current MacBook Pro environment is:

Operating System: macOS Sierra 10.12.6
Python Version: Python 2.7.13, installed by Homebrew
C++ compiler version: macOS system clang

Apple LLVM version 9.0.0 (clang-900.0.38)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

MPI version: MPICH 3.2.0, installed by Homebrew. mpicxx -show:

clang++ -Wl,-flat_namespace -Wl,-commons,use_dylibs -I/usr/local/Cellar/mpich/3.2_3/include -L/usr/local/Cellar/mpich/3.2_3/lib -lmpicxx -lmpi -lpmpi

corresponding to one of these Homebrew-built binaries (bottles):

...
commit d414b50c6744a47b1cbfa72f716bf8b39720684d
Author: BrewTestBot <brew-test-bot@googlegroups.com>
Date:   Mon Sep 18 18:37:15 2017 +0000

    mpich: update 3.2_3 bottle.

commit 54c585fbd1d02b4271a5e154e1cda4458944cfb0
Author: BrewTestBot <brew-test-bot@googlegroups.com>
Date:   Thu May 4 02:36:36 2017 +0000

    mpich: update 3.2_3 bottle.
...

It the process of debugging, I have tried many different versions of compilers, Python environments, build options, and macOS versions. It also occurs with:

gcc 7.1 or 4.9 installed by Homebrew
System-managed or user Homebrew-managed Python 2.7 installations. Also tried:
- virtualenv for Python.
- Starting Python in unbuffered binary stdout and stderr mode, via python -u
OpenMPI or MPICH installed by Homebrew or source, compiled with gcc or clang
GNU Make versions 4.2.1, 3.8.1, either macOS system or Homebrew managed
- Serial or parallel Make
- Tried --output-sync option for GNU Make version > 4.0 to ensure that parallel make output is buffered and well-ordered.
Various releases of macOS 10.10, 10.11, 10.12

Explanation

I have encountered related bug reports in the Travis CI, GNU Make, and macOS communities, but never found a complete explanation until recently. The 12/1/17 reply on travis-ci/travis-ci#4704 explains that this bug is caused by the EAGAIN signal, "try again/ data not ready" from nonblocking socket.

To query if O_NONBLOCK is set, you can use:

python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stdout, fcntl.F_GETFL); print(flags&os.O_NONBLOCK);'

Embedding this command in the regression test driver returns 0 (O_NONBLOCK disabled) after every test until an MPI-enabled test executes, then it returns a nonzero number. So, the open question is: why do the MPI-enabled regression test scripts turn on nonblocking IO, and why does it occur only after the overall completion of the test?

Possible fixes

Disable nonblocking IO after all regression tests, or after an MPI-enabled regression test. I am currently placing the following command in my driver script:

python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stdout, fcntl.F_GETFL); fcntl.fcntl(sys.stdout, fcntl.F_SETFL, flags&~os.O_NONBLOCK);'

Redirect the athena.py output of make() command to a file by replacing the subprocess.check_call() commands with subprocess.Popen() and pipes. Or, suppress the output by replacing with subprocess.check_output() (without ever communicating the stream to stdout).
Figure out how to prevent the MPI-enabled regression test from switching to nonblocking IO in the first place.

The text was updated successfully, but these errors were encountered:

felker · 2017-12-07T17:07:40Z

Update: it turns out that is a known MPICH bug on macOS pmodels/mpich#1782

It was apparently fixed in the MPICH repository about a month ago: pmodels/mpich#2755

Now, I am unable to reproduce the bug with OpenMPI on macOS. I don't think any modification needs to be made to the Athena++ regression test suite at this time. It is just something for macOS users to be aware of when running it with MPICH version < 3.2.1 which was released on 11/10/2017. Presently, Homebrew users need to install MPICH with the --HEAD flag to get the patched 3.2.1. The --devel flag results in the year-old https://github.com/pmodels/mpich/releases/tag/v3.3a2

- Not compatible with Python 3, see #169. Follow PEP 394 and reserve "python" for scripts that are compatible with both Py2 and Py3. - Remove artificial C++ style error in field.hpp - This is the first change from upstream cpplint.py google/styleguide@1b206ee Calling "python -u" in cpplint_athena.sh has fixed the previously jumbled stdout and stderr from cpplint.py calls to sys.stdout.write() and sys.stderr.write() in the Jenkins log. However, unbuffered writes may have been an issue with regression test crashes in macOS and on Travis CI (see #47), but it was most likely a red herring due to the known O_NONBLOCK bug in MPICH's mpirun(). Monitor this. See if Jenkins log now contains pure output from git ls-tree command.

felker added the bug Broken functionality or unexpected result label Dec 5, 2017

felker closed this as completed Dec 7, 2017

felker mentioned this issue Apr 13, 2018

Add continuous integration with Travis CI and Jenkins #97

Merged

25 tasks

felker added the testing Regression test suite and CI label Apr 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression test suite with MPI causes nonblocking IO write errors on macOS #47

Regression test suite with MPI causes nonblocking IO write errors on macOS #47

felker commented Dec 5, 2017 •

edited

Loading

felker commented Dec 7, 2017 •

edited

Loading

Regression test suite with MPI causes nonblocking IO write errors on macOS #47

Regression test suite with MPI causes nonblocking IO write errors on macOS #47

Comments

felker commented Dec 5, 2017 • edited Loading

Bug report

felker commented Dec 7, 2017 • edited Loading

felker commented Dec 5, 2017 •

edited

Loading

felker commented Dec 7, 2017 •

edited

Loading