Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MKL FFT failures #110

Closed
tskisner opened this issue Aug 19, 2017 · 10 comments
Closed

MKL FFT failures #110

tskisner opened this issue Aug 19, 2017 · 10 comments
Assignees
Labels

Comments

@tskisner
Copy link
Member

When building with gcc and linking to modern MKL versions, the MKL unit tests fail. This happens for both single and batched FFTs. This problem is not seen when using FFTW. I will disable MKL FFTs in the mean time.

@tskisner tskisner added the bug label Aug 19, 2017
@tskisner tskisner self-assigned this Aug 19, 2017
@tskisner
Copy link
Member Author

I thought this problem looked suspiciously familiar... I actually mostly fixed this in an earlier commit:

30c24f9

But this was done directly to master accidentally as part of other changes and then reverted. The fix / cleanup was then never migrated to a branch for proper testing and merging. I'll cherry pick this into a branch now for testing.

@tskisner
Copy link
Member Author

tskisner commented Sep 7, 2017

With the latest master branch (after merging the reworking of MKL FFTs in ee43a97), all C++ and python unit tests pass on a workstation using Intel compilers and MKL version 2018-beta. On edison.nersc.gov, using gcc and MKL, the following behavior is seen:

  1. C++ "single" (one FFT) roundtrip passes.
  2. C++ "multi" (batch of FFTs) roundtrip fails.
  3. Python unit test of wrapper around C++ FFT roundtrip fails. This uses a single FFT, but uses a persistent cache of FFT plans.
  4. Noise simulation unit test "fails", but this may or may not be related, since the threshold is kind of arbitrary.

Leaving this issue open to track why this combination (gcc + MKL on edison) fails while a pure Intel build does not.

@tskisner
Copy link
Member Author

tskisner commented Sep 8, 2017

Testing this on cori.nersc.gov (KNL) with Intel compilers and MKL also passes all tests. So this seems to be an interaction between MKL and gcc-6.3 on edison. I will test that same combination on a workstation to determine if it is NERSC-specific.

@tskisner
Copy link
Member Author

tskisner commented Sep 8, 2017

Ok, confirmed that I see the same failures on a linux workstation with latest MKL and gcc-6.3. That should make it much easier to debug.

@tskisner
Copy link
Member Author

tskisner commented Sep 8, 2017

On deeper inspection, this is actually a problem with the persistent cache of fft plans. In the older versions of toast, we had independent OpenMP threads that were processing pieces of the data and so we needed different FFT plans for each thread. In the current version of toast, we made a design decision to keep threading at a lower level, so that mid-level operations could be done in python. So rather than individual threads doing separate FFTs, we rely on the FFT libraries being smart about threading (which for 1D FFTs is not great, but...) The per-process cache of FFT plans should then have only one copy of the plan rather than one per thread. I now have unit tests that use the plan cache and fail as they should. Working on this fix.

@tskisner
Copy link
Member Author

tskisner commented Sep 9, 2017

OK, I found the real source of the problem. We always link to the single-library version of MKL (libmkl_rt). However, as documented here:

https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/657060

This defaults to using the Intel OpenMP library rather than the GNU one. This means that when compiling with gcc and linking to MKL, operations with OMP_NUM_THREADS > 1 will give the wrong answer.

The solution is to explicitly link to the full set of individual MKL libraries, and to use the correct set depending on whether one is compiling with the Intel compilers or gcc.

@tskisner
Copy link
Member Author

tskisner commented Sep 10, 2017

I explored this in more detail, and implemented a test during configure that links to MKL differently depending on the compiler used. This works when linking C++ code (like the toast_test executable used to run the compiled unit tests).

However, when we dlopen (through ctypes) the libtoast.so library, it tries to dlopen its chain of dependency libraries and claims that libmkl_gnu_thread has missing symbols (it does, and they are defined in libmkl_core). This seems to leave 3 choices:

  1. Figure out why loading libtoast does not load dependencies (I tried extra linking options like -Wl,--no-undefined when building libtoast) or does not load them in the right order.

  2. Always link to the static MKL libraries, which will then be built in to libtoast.so. This "probably" won't cause problems, since libtoast is the top-level object (we are not linking libtoast to other code which might already have MKL symbols in it)

  3. Go back to linking to libmkl_rt, and then include a snippet of code (generated at configure time) which is called at library initialization and calls the right MKL functions that set the threading layer to whatever compiler we used.

Static linking in option 2 might introduce problems in cross compilation. Also, trying to get the linking of libtoast "just right" so that option 1 works could be very machine / compiler / mkl version specific. Instead, I think option 3 is the way to go. There would be toast function "toast::mkl_init()" which would:

  1. Be a no-op if we don't have MKL or if the function has already been called.

  2. Call the mkl functions to set the interface and threading layers based on whether we are using the Intel or GNU compilers. We don't currently use MKL on OS X with Clang, so we don't need to consider that. On OS X we should probably use the accelerate framework instead anyway.

  3. be called internally in toast before making MKL calls (this is currently in the fft and special function source files).

@tskisner
Copy link
Member Author

The discussion so far in this issue holds true and makes sense for C++ code linking to MKL. If you use gnu compilers and -fopenmp, but then want to link to threaded MKL, you must ensure that the gnu threading interface is used (by calling mkl_set_threading_layer() or setting the proper environment variable).

Things become more difficult when throwing in anaconda python. Anaconda (and Intel Distribution for Python) ship with their own version of libmkl_rt and the Intel thread interface (no libmkl_gnu_thread.so). Since Python / Numpy does not use OpenMP, this is fine- the Intel thread library is always used.

In toast, we can have the situation where we have a threaded C++ library built with gnu compilers, linking to threaded MKL. We can make these play together using the techniques above. However, we are dlopen'ing that library from python (using ctypes). When this happens, somehow libtoast.so calls to MKL end up using symbols / functions in the libmkl_rt from anaconda.

For compatibility, I believe we will either have to statically link to MKL when building toast, OR simply not allow use of MKL with non-Intel compilers. In the latter case, it is troubling that our runtime calls would be using the MKL shipped with python packages rather than the (potentially newer) MKL directly from Intel at compile time. For example, this means that the C++ unit tests run from the toast_test executable and the same tests run from within the python toast.test() function would have different code paths (through different MKL libraries). This point is one that pushes us towards static linking of MKL at compile time.

@tskisner
Copy link
Member Author

I have looked into the static linking option, and even after jumping through hoops with libtool this does not really work. I think the only sane path here is to simply only use MKL when we are also using the Intel compiler. Again the solution for pure C++ code is straightforward (use libmkl_rt with threading layer set to gnu or intel, OR link explicitly to the list of MKL libraries).

When mixing threaded C++ code that uses MKL and loading this into python which has a different MKL library already loaded, the C++ code must be compatible with the MKL used by python. Which means it must use intel threads. Which means it must be built with the intel compiler.

@tskisner
Copy link
Member Author

Closed by #114

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant