Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression test suite with MPI causes nonblocking IO write errors on macOS #47

Closed
felker opened this issue Dec 5, 2017 · 1 comment
Closed
Labels
bug Broken functionality or unexpected result testing Regression test suite and CI

Comments

@felker
Copy link
Contributor

felker commented Dec 5, 2017

This is a subtle problem that I have noticed on my MacBook Pro and when using Travis CI with osx VM builds over the past few months. I have just now diagnosed the issue, but it would be good if others can reproduce this and discuss a workaround (if full macOS support is desired).

Bug report

Bug summary

The make() command in tst/regression/scripts/utils/athena.py, invoked by a regression test module's prepare() step fails in the linker with

make: write error: stdout

being written in the middle of the stdout stream.

The challenge in tracking down this bug is that it manifests in any regression test only after an MPI-enabled regression test (which itself passes). Currently, the only regression test scripts that use MPI are:

  • mpi/mpi_linwave.py
  • grav/jeans_3d.py

Code for reproduction

cd tst/regression
python run_tests.py mpi/mpi_linwave hydro/sod_shock

Or, another example:

cd tst/regression
python run_tests.py grav/jeans_3d outputs/all_outputs

The error only occurs in the make commands invoked in the separate test scripts following the MPI regression test. In other words, even though the mpi_linwave.py first compiles an MPI binary and then a serial binary in the same module, there is no write error in that test.

Furthermore, the bug only occurs either when:

  1. Both tests are run in the same Python command, as above
  2. The multiple python run_tests.py testname commands are executed in the same process/script.

So, command line execution of

cd tst/regression
python run_tests.py grav/jeans_3d
python outputs/all_outputs

works fine, but running those commands in a Bash script will produce the error. Hence, in VM environments like Travis CI, which wraps the user's commands in a command called travis_run_script, this issue may appear.

Actual outcome

...

g++  -O3 -o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/bin/athena /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/main.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/globals.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/parameter_input.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/get_boundary_flag.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/reflect.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/bvals_mg.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/bvals_cc.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/polarwedge.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/bvals_base.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/flux_correction_fc.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/bvals_grav.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/flux_correction_cc.o /Users/kfelker/Desktop/athena-trunk-clean/tsmake: write error: stdout
Traceback (most recent call last):
  File "run_tests.py", line 76, in main
    module.prepare()
  File "/Users/kfelker/Desktop/athena-trunk-clean/tst/regression/scripts/tests/hydro/hydro_linwave.py", line 21, in prepare
    athena.make()
  File "/Users/kfelker/Desktop/athena-trunk-clean/tst/regression/scripts/utils/athena.py", line 46, in make
    .format(err.returncode,' '.join(err.cmd)))
AthenaError: Return code 1 from command 'make -j EXE_DIR:=/Users/kfelker/Desktop/athena-trunk-clean/tst/regression/bin/ OBJ_DIR:=/Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/'
---> Error in scripts/tests/hydro/hydro_linwave.py

Results:
    mpi.mpi_linwave: passed
    hydro.sod_shock: failed -- unexpected failure in prepare() stage
    hydro.hydro_linwave: failed -- unexpected failure in prepare() stage

Summary: 1 out of 3 tests passed

I have also observed:

IOError: [Errno 35] Resource temporarily unavailable

instead of make: write error when using alternative subprocess commands; see below.

Version Information

My current MacBook Pro environment is:

  • Operating System: macOS Sierra 10.12.6
  • Python Version: Python 2.7.13, installed by Homebrew
  • C++ compiler version: macOS system clang
Apple LLVM version 9.0.0 (clang-900.0.38)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
  • MPI version: MPICH 3.2.0, installed by Homebrew. mpicxx -show:
clang++ -Wl,-flat_namespace -Wl,-commons,use_dylibs -I/usr/local/Cellar/mpich/3.2_3/include -L/usr/local/Cellar/mpich/3.2_3/lib -lmpicxx -lmpi -lpmpi

corresponding to one of these Homebrew-built binaries (bottles):

...
commit d414b50c6744a47b1cbfa72f716bf8b39720684d
Author: BrewTestBot <brew-test-bot@googlegroups.com>
Date:   Mon Sep 18 18:37:15 2017 +0000

    mpich: update 3.2_3 bottle.

commit 54c585fbd1d02b4271a5e154e1cda4458944cfb0
Author: BrewTestBot <brew-test-bot@googlegroups.com>
Date:   Thu May 4 02:36:36 2017 +0000

    mpich: update 3.2_3 bottle.
...

It the process of debugging, I have tried many different versions of compilers, Python environments, build options, and macOS versions. It also occurs with:

  • gcc 7.1 or 4.9 installed by Homebrew
  • System-managed or user Homebrew-managed Python 2.7 installations. Also tried:
    • virtualenv for Python.
    • Starting Python in unbuffered binary stdout and stderr mode, via python -u
  • OpenMPI or MPICH installed by Homebrew or source, compiled with gcc or clang
  • GNU Make versions 4.2.1, 3.8.1, either macOS system or Homebrew managed
    • Serial or parallel Make
    • Tried --output-sync option for GNU Make version > 4.0 to ensure that parallel make output is buffered and well-ordered.
  • Various releases of macOS 10.10, 10.11, 10.12

Explanation

I have encountered related bug reports in the Travis CI, GNU Make, and macOS communities, but never found a complete explanation until recently. The 12/1/17 reply on travis-ci/travis-ci#4704 explains that this bug is caused by the EAGAIN signal, "try again/ data not ready" from nonblocking socket.

To query if O_NONBLOCK is set, you can use:

python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stdout, fcntl.F_GETFL); print(flags&os.O_NONBLOCK);'

Embedding this command in the regression test driver returns 0 (O_NONBLOCK disabled) after every test until an MPI-enabled test executes, then it returns a nonzero number. So, the open question is: why do the MPI-enabled regression test scripts turn on nonblocking IO, and why does it occur only after the overall completion of the test?

Possible fixes

  • Disable nonblocking IO after all regression tests, or after an MPI-enabled regression test. I am currently placing the following command in my driver script:
python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stdout, fcntl.F_GETFL); fcntl.fcntl(sys.stdout, fcntl.F_SETFL, flags&~os.O_NONBLOCK);'
  • Redirect the athena.py output of make() command to a file by replacing the subprocess.check_call() commands with subprocess.Popen() and pipes. Or, suppress the output by replacing with subprocess.check_output() (without ever communicating the stream to stdout).
  • Figure out how to prevent the MPI-enabled regression test from switching to nonblocking IO in the first place.
@felker felker added the bug Broken functionality or unexpected result label Dec 5, 2017
@felker
Copy link
Contributor Author

felker commented Dec 7, 2017

Update: it turns out that is a known MPICH bug on macOS pmodels/mpich#1782

It was apparently fixed in the MPICH repository about a month ago: pmodels/mpich#2755

Now, I am unable to reproduce the bug with OpenMPI on macOS. I don't think any modification needs to be made to the Athena++ regression test suite at this time. It is just something for macOS users to be aware of when running it with MPICH version < 3.2.1 which was released on 11/10/2017. Presently, Homebrew users need to install MPICH with the --HEAD flag to get the patched 3.2.1. The --devel flag results in the year-old https://github.com/pmodels/mpich/releases/tag/v3.3a2

@felker felker closed this as completed Dec 7, 2017
@felker felker added the testing Regression test suite and CI label Apr 13, 2018
felker added a commit that referenced this issue Nov 2, 2018
- Not compatible with Python 3, see #169. Follow PEP 394 and reserve "python"
  for scripts that are compatible with both Py2 and Py3.

- Remove artificial C++ style error in field.hpp

- This is the first change from upstream cpplint.py
google/styleguide@1b206ee

Calling "python -u" in cpplint_athena.sh has fixed the previously
jumbled stdout and stderr from cpplint.py calls to sys.stdout.write()
and sys.stderr.write() in the Jenkins log. However, unbuffered writes
may have been an issue with regression test crashes in macOS and on
Travis CI (see #47), but it was most likely a red herring due to the
known O_NONBLOCK bug in MPICH's mpirun(). Monitor this.

See if Jenkins log now contains pure output from git ls-tree command.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Broken functionality or unexpected result testing Regression test suite and CI
Projects
None yet
Development

No branches or pull requests

1 participant