You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a subtle problem that I have noticed on my MacBook Pro and when using Travis CI with osx VM builds over the past few months. I have just now diagnosed the issue, but it would be good if others can reproduce this and discuss a workaround (if full macOS support is desired).
Bug report
Bug summary
The make() command in tst/regression/scripts/utils/athena.py, invoked by a regression test module's prepare() step fails in the linker with
make: write error: stdout
being written in the middle of the stdout stream.
The challenge in tracking down this bug is that it manifests in any regression test only after an MPI-enabled regression test (which itself passes). Currently, the only regression test scripts that use MPI are:
mpi/mpi_linwave.py
grav/jeans_3d.py
Code for reproduction
cd tst/regression
python run_tests.py mpi/mpi_linwave hydro/sod_shock
Or, another example:
cd tst/regression
python run_tests.py grav/jeans_3d outputs/all_outputs
The error only occurs in the make commands invoked in the separate test scripts following the MPI regression test. In other words, even though the mpi_linwave.py first compiles an MPI binary and then a serial binary in the same module, there is no write error in that test.
Furthermore, the bug only occurs either when:
Both tests are run in the same Python command, as above
The multiple python run_tests.py testname commands are executed in the same process/script.
So, command line execution of
cd tst/regression
python run_tests.py grav/jeans_3d
python outputs/all_outputs
works fine, but running those commands in a Bash script will produce the error. Hence, in VM environments like Travis CI, which wraps the user's commands in a command called travis_run_script, this issue may appear.
Actual outcome
...
g++ -O3 -o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/bin/athena /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/main.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/globals.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/parameter_input.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/get_boundary_flag.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/reflect.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/bvals_mg.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/bvals_cc.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/polarwedge.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/bvals_base.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/flux_correction_fc.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/bvals_grav.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/flux_correction_cc.o /Users/kfelker/Desktop/athena-trunk-clean/tsmake: write error: stdout
Traceback (most recent call last):
File "run_tests.py", line 76, in main
module.prepare()
File "/Users/kfelker/Desktop/athena-trunk-clean/tst/regression/scripts/tests/hydro/hydro_linwave.py", line 21, in prepare
athena.make()
File "/Users/kfelker/Desktop/athena-trunk-clean/tst/regression/scripts/utils/athena.py", line 46, in make
.format(err.returncode,' '.join(err.cmd)))
AthenaError: Return code 1 from command 'make -j EXE_DIR:=/Users/kfelker/Desktop/athena-trunk-clean/tst/regression/bin/ OBJ_DIR:=/Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/'
---> Error in scripts/tests/hydro/hydro_linwave.py
Results:
mpi.mpi_linwave: passed
hydro.sod_shock: failed -- unexpected failure in prepare() stage
hydro.hydro_linwave: failed -- unexpected failure in prepare() stage
Summary: 1 out of 3 tests passed
It the process of debugging, I have tried many different versions of compilers, Python environments, build options, and macOS versions. It also occurs with:
gcc 7.1 or 4.9 installed by Homebrew
System-managed or user Homebrew-managed Python 2.7 installations. Also tried:
virtualenv for Python.
Starting Python in unbuffered binary stdout and stderr mode, via python -u
OpenMPI or MPICH installed by Homebrew or source, compiled with gcc or clang
GNU Make versions 4.2.1, 3.8.1, either macOS system or Homebrew managed
Serial or parallel Make
Tried --output-sync option for GNU Make version > 4.0 to ensure that parallel make output is buffered and well-ordered.
Various releases of macOS 10.10, 10.11, 10.12
Explanation
I have encountered related bug reports in the Travis CI, GNU Make, and macOS communities, but never found a complete explanation until recently. The 12/1/17 reply on travis-ci/travis-ci#4704 explains that this bug is caused by the EAGAIN signal, "try again/ data not ready" from nonblocking socket.
Embedding this command in the regression test driver returns 0 (O_NONBLOCK disabled) after every test until an MPI-enabled test executes, then it returns a nonzero number. So, the open question is: why do the MPI-enabled regression test scripts turn on nonblocking IO, and why does it occur only after the overall completion of the test?
Possible fixes
Disable nonblocking IO after all regression tests, or after an MPI-enabled regression test. I am currently placing the following command in my driver script:
Redirect the athena.py output of make() command to a file by replacing the subprocess.check_call() commands with subprocess.Popen() and pipes. Or, suppress the output by replacing with subprocess.check_output() (without ever communicating the stream to stdout).
Figure out how to prevent the MPI-enabled regression test from switching to nonblocking IO in the first place.
The text was updated successfully, but these errors were encountered:
Update: it turns out that is a known MPICH bug on macOS pmodels/mpich#1782
It was apparently fixed in the MPICH repository about a month ago: pmodels/mpich#2755
Now, I am unable to reproduce the bug with OpenMPI on macOS. I don't think any modification needs to be made to the Athena++ regression test suite at this time. It is just something for macOS users to be aware of when running it with MPICH version < 3.2.1 which was released on 11/10/2017. Presently, Homebrew users need to install MPICH with the --HEAD flag to get the patched 3.2.1. The --devel flag results in the year-old https://github.com/pmodels/mpich/releases/tag/v3.3a2
- Not compatible with Python 3, see #169. Follow PEP 394 and reserve "python"
for scripts that are compatible with both Py2 and Py3.
- Remove artificial C++ style error in field.hpp
- This is the first change from upstream cpplint.py
google/styleguide@1b206ee
Calling "python -u" in cpplint_athena.sh has fixed the previously
jumbled stdout and stderr from cpplint.py calls to sys.stdout.write()
and sys.stderr.write() in the Jenkins log. However, unbuffered writes
may have been an issue with regression test crashes in macOS and on
Travis CI (see #47), but it was most likely a red herring due to the
known O_NONBLOCK bug in MPICH's mpirun(). Monitor this.
See if Jenkins log now contains pure output from git ls-tree command.
This is a subtle problem that I have noticed on my MacBook Pro and when using Travis CI with
osx
VM builds over the past few months. I have just now diagnosed the issue, but it would be good if others can reproduce this and discuss a workaround (if full macOS support is desired).Bug report
Bug summary
The
make()
command intst/regression/scripts/utils/athena.py
, invoked by a regression test module'sprepare()
step fails in the linker withbeing written in the middle of the stdout stream.
The challenge in tracking down this bug is that it manifests in any regression test only after an MPI-enabled regression test (which itself passes). Currently, the only regression test scripts that use MPI are:
mpi/mpi_linwave.py
grav/jeans_3d.py
Code for reproduction
cd tst/regression python run_tests.py mpi/mpi_linwave hydro/sod_shock
Or, another example:
cd tst/regression python run_tests.py grav/jeans_3d outputs/all_outputs
The error only occurs in the
make
commands invoked in the separate test scripts following the MPI regression test. In other words, even though thempi_linwave.py
first compiles an MPI binary and then a serial binary in the same module, there is no write error in that test.Furthermore, the bug only occurs either when:
python run_tests.py testname
commands are executed in the same process/script.So, command line execution of
cd tst/regression python run_tests.py grav/jeans_3d python outputs/all_outputs
works fine, but running those commands in a Bash script will produce the error. Hence, in VM environments like Travis CI, which wraps the user's commands in a command called
travis_run_script
, this issue may appear.Actual outcome
I have also observed:
instead of
make: write error
when using alternativesubprocess
commands; see below.Version Information
My current MacBook Pro environment is:
clang
mpicxx -show
:corresponding to one of these Homebrew-built binaries (bottles):
It the process of debugging, I have tried many different versions of compilers, Python environments, build options, and macOS versions. It also occurs with:
gcc
7.1 or 4.9 installed by Homebrewvirtualenv
for Python.python -u
gcc
orclang
--output-sync
option for GNU Make version > 4.0 to ensure that parallel make output is buffered and well-ordered.Explanation
I have encountered related bug reports in the Travis CI, GNU Make, and macOS communities, but never found a complete explanation until recently. The 12/1/17 reply on travis-ci/travis-ci#4704 explains that this bug is caused by the EAGAIN signal, "try again/ data not ready" from nonblocking socket.
To query if
O_NONBLOCK
is set, you can use:Embedding this command in the regression test driver returns 0 (
O_NONBLOCK
disabled) after every test until an MPI-enabled test executes, then it returns a nonzero number. So, the open question is: why do the MPI-enabled regression test scripts turn on nonblocking IO, and why does it occur only after the overall completion of the test?Possible fixes
athena.py
output ofmake()
command to a file by replacing thesubprocess.check_call()
commands withsubprocess.Popen()
and pipes. Or, suppress the output by replacing withsubprocess.check_output()
(without ever communicating the stream to stdout).The text was updated successfully, but these errors were encountered: