Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault: Calling finufft and fftw in threaded loop #62

Closed
jkrimmer opened this issue Aug 22, 2024 · 12 comments
Closed

Segfault: Calling finufft and fftw in threaded loop #62

jkrimmer opened this issue Aug 22, 2024 · 12 comments

Comments

@jkrimmer
Copy link
Contributor

jkrimmer commented Aug 22, 2024

Unfortunately, I stumbled another segmentation fault caused by multithreading in julia: Calling both finufft and fftw in a threaded loop yields a segmentation fault (Windows and Linux). Here, the order in which finufft and fftw are called does not matter.

MWE

Make sure to initialize julia with multiple threads, e.g., run julia --threads=auto.

using FFTW
using FINUFFT

function _inner(N)
    y = randn(ComplexF64, N)
    x = randn(N)
    nufft1d1(x, y, 1, 1e-8, N, nthreads=1)
    # fft(x)
end

function issue(iters=1000, N=1024; threaded=false)
    if threaded
        Threads.@threads for k in 1:iters
            _inner(N)
        end
    else
        for k in 1:iters
            _inner(N)
        end
    end
end

@info "single-threaded loop" 
issue()
@info "multi-threaded loop"
issue(threaded=true)

Note: Commenting the fft-call or the nufft1d1-call makes the issue disappear.

General information:
julia v1.10.4
FINUFFT v3.2.0
finufft_jll v2.2.0+2

Potential but suboptimal solution

The only mitigation approaches I have found so far is going back to libfinufft_jll v2.1.0 (where flatironinstitute/finufft#354 has not been merged) or building a more recent version where fftw_make_planner_thread_safe is re-introduced which is probably not desired. Alternatively, building finufft with DUCC instead of FFTW works as well.

@ahbarnett
Copy link
Collaborator

ahbarnett commented Aug 22, 2024 via email

@jkrimmer
Copy link
Contributor Author

Hi Alex,

Thank you for looking into this issue. Fortunately, I have already been able to test the DUCC option which resolves this issue. Maybe there is some clash between the FFTW-planner calls originating from julia and the ones from within finufft?

Thanks
Jonas

@ludvigak
Copy link
Owner

Well, I guess the good news is that thanks to @jkrimmer we now have working crash examples for Windows, Linux, and MacOS!

Not sure if I have any insight though, seems to me like the fundamental issue is the FFTW/FINUFFT interplay with threading active, but then we should be able to reproduce this directly in C++ code.

@jkrimmer
Copy link
Contributor Author

Finally, I have been able to reproduce the issue in C++. It seems like, the finufft-internal fftw-lock does not prevent other calls to fftw from building an FFTW-plan. Please find an MWE (on Ubuntu, this MWE builds with g++ test_finufft_crash.cpp -o test_finufft_crash -I[include] -L[lib] -lfinufft -lfftw3_threads -lfftw3 -fopenmp -Wl,-rpath=[lib] -std=c++11) in the collapsed section below. Commenting out the lock around the finufft-planner-stage prevents the segfault.

MWE
#include <iostream>
#include <complex>
#include <vector>
#include <random>
#include <mutex>

#include <fftw3.h>
#include <finufft.h>
#include <omp.h>

using namespace std;

#define N 65384

int main() {
    int64_t Ns[3]; // guru describes mode array by vector [N1,N2..]
    Ns[0] = N;

    // set lock
    recursive_mutex lck;

    finufft_opts *opts = new finufft_opts;         // opts is pointer to struct
    finufft_default_opts(opts);
    opts->nthreads = 1;
    opts->debug = 0;

    // random nonuniform points (x) and complex strengths (c)...
    vector<complex<double>> c(N);

    // init FFTW threads
    fftw_init_threads();

    // FFTW and FINUFFT execution using OpenMP parallelization
    #pragma omp parallel for
    for (int j = 0; j < 100; ++j) {

        // allocate output array for FFTW...
        vector<complex<double>> F1(N);
        // FFTW plan
        lck.lock();
        fftw_plan_with_nthreads(1);
        fftw_plan plan = fftw_plan_dft_1d(N, reinterpret_cast<fftw_complex*>(c.data()), reinterpret_cast<fftw_complex*>(F1.data()), FFTW_FORWARD, FFTW_ESTIMATE);
        fftw_destroy_plan(plan);
        lck.unlock();
        
        // FINUFFT plan
        // lck.lock();
        finufft_plan nufftplan; 
        finufft_makeplan(1, 1, Ns, 1, 1, 1e-6, &nufftplan, opts);
        finufft_destroy(nufftplan);
        // lck.unlock();
    }
    return 0;
}

Similarly, we observe the same behavior in julia. Running the following code with multiple threads yields a segmentation fault on my machine

using FFTW
using FINUFFT

tmp = randn(ComplexF64, 1024)

Threads.@threads for k = 1:1000
    plan_fft(tmp)
    finufft_makeplan(1, [1024], 1, 1, 1e-6)
end

However, adding an additional lock to the finufft-planning avoids the segfault:

using FFTW
using FINUFFT

tmp = randn(ComplexF64, 1024)
lk = ReentrantLock()

Threads.@threads for k = 1:1000
    plan_fft(tmp)
     
    lock(lk)
    try
        finufft_makeplan(1, [1024], 1, 1, 1e-6)
    finally
        unlock(lk)
    end
end

@ahbarnett
Copy link
Collaborator

Robert @blackwer since you are our expert on fftw global state and thread-safety, would you be able to take a look at the above C++ MWE which causes crash apparently? (I have only verified the julia-version crash) Thanks, Alex

@blackwer
Copy link

blackwer commented Sep 3, 2024

I don't think there is a finufft side solution to this. There's no way for the finufft side to detect what the application side is doing. This is actually in the discussion at https://www.fftw.org/doc/Thread-safety.html, a document I have had to look at way too many times.

One proper solution is for the application to call fftw_make_planner_thread_safe() rather than finufft. finufft can do this, and in practice it's usually sufficient, but there are cases where it could still crash even with the older version that did have this call in it. E.g. if you call finufft_make_plan before you make fftw_plan_dft_1d in your MWE, this solution works. It doesn't when you call fftw_plan_dft_1d directly first. It might work, but it's technically a race condition.

Another solution would be to supply a mechanism to acquire a pointer to the finufft lock on the application side, but I don't think this is adequate for the julia case, since that'd have to end up in the fftw code, which doesn't make much sense.

Is it possible on the julia side to have the FINUFFT import make the fftw_make_planner_thread_safe() call? Otherwise I think the only solution is a custom build either with an alternative FFT implementation or with a patch to the FFTW_INIT routine (though this does have the race condition).

@jkrimmer
Copy link
Contributor Author

jkrimmer commented Sep 4, 2024

Thank you very much for the detailed explanation! I assume that making the call to fftw_make_planner_thread_safe during the FINUFFT initialization in julia should work and provide a quick fix.

Lately, I have been considering another approach: The FFTW-interface FFTW.jl uses julia's ReentrantLock() to protect the FFTW-planning stage from race conditions. If we could pass an application defined lock/locking function to the finufft library, e.g., based on this ReentrantLock, the race conditions would also be avoided in my understanding. Do you think this approach could work as well?

@blackwer
Copy link

blackwer commented Sep 4, 2024

Lately, I have been considering another approach: The FFTW-interface FFTW.jl uses julia's ReentrantLock() to protect the FFTW-planning stage from race conditions. If we could pass an application defined lock/locking function to the finufft library, e.g., based on this ReentrantLock, the race conditions would also be avoided in my understanding. Do you think this approach could work as well?

This definitely works, and can be done straightforwardly without having to extend the API (by having two void (func*)(void *) function pointers on the plan and an optional void * for the lock data). This was very straightforward to implement on the finufft side, so could feasibly make it into 2.3 (maybe?). I've written an example implementation at https://github.com/blackwer/finufft/tree/user_fftw_lock with new driver code

MWE
#include <complex>
#include <vector>
#include <mutex>

#include <fftw3.h>
#include <finufft.h>
#include <omp.h>

using namespace std;

#define N 65384

void locker(void *lck) { reinterpret_cast<recursive_mutex *>(lck)->lock(); }
void unlocker(void *lck) { reinterpret_cast<recursive_mutex *>(lck)->unlock(); }

int main() {
    int64_t Ns[3]; // guru describes mode array by vector [N1,N2..]
    Ns[0] = N;
    recursive_mutex lck;

    finufft_opts opts;
    finufft_default_opts(&opts);
    opts.nthreads = 1;
    opts.debug = 0;
    opts.fftw_lock_fun = locker;
    opts.fftw_unlock_fun = unlocker;
    opts.fftw_lock_data = reinterpret_cast<void *>(&lck);

    // random nonuniform points (x) and complex strengths (c)...
    vector<complex<double>> c(N);

    // init FFTW threads
    fftw_init_threads();

    // FFTW and FINUFFT execution using OpenMP parallelization
    #pragma omp parallel for
    for (int j = 0; j < 100; ++j) {

        // allocate output array for FFTW...
        vector<complex<double>> F1(N);

        // FFTW plan
        lck.lock();
        fftw_plan_with_nthreads(1);
        fftw_plan plan = fftw_plan_dft_1d(N, reinterpret_cast<fftw_complex*>(c.data()), reinterpret_cast<fftw_complex*>(F1.data()), FFTW_FORWARD, FFTW_ESTIMATE);
        fftw_destroy_plan(plan);
        lck.unlock();

        // FINUFFT plan
        finufft_plan nufftplan;
        finufft_makeplan(1, 1, Ns, 1, 1, 1e-6, &nufftplan, &opts);
        finufft_destroy(nufftplan);
    }
    return 0;
}

@ahbarnett thoughts?

@DiamonDinoia
Copy link

This is interesting, having a global lock passed in c++ is not an issue. But what about having a Julia lock passed to finufft?

Alternatively, one has to acquire the lock in Julia and then do the entire finufft_makeplan in a critical section. Which feels wasteful as the only non thread safe code is fftw_plan.

As for 2.3.0, I would not add this to it. I would do a 2.3.1 or put it in 2.4.0 if we release this soon. 2.3.0 has been on the working for too long we should release it asap :)

@blackwer
Copy link

blackwer commented Sep 4, 2024

This is interesting, having a global lock passed in c++ is not an issue. But what about having a Julia lock passed to finufft?

You should be able to pass from any language that lets you make a standard C-style callback function, which julia definitely lets you do, even with closures. Python does too, if you could make this be a problem there with its GIL limited threading. https://julialang.org/blog/2013/05/callback/ has details on how do do this in julia.

As for 2.3.0, I would not add this to it. I would do a 2.3.1 or put it in 2.4.0 if we release this soon. 2.3.0 has been on the working for too long we should release it asap :)

I'm in no rush, and if Jonas is happy with the stop-gap fftw_make_planner_thread_safe option, then I am. But the provided solution does already work (I think I just fixed the last issue) and is about as language agnostic as possible with a very small code footprint.

@ahbarnett
Copy link
Collaborator

ahbarnett commented Sep 4, 2024 via email

@jkrimmer
Copy link
Contributor Author

jkrimmer commented Sep 9, 2024

Dear all,

Thanks a lot for your work on this issue! I have successfully tested the user-locking approach by @blackwer in julia over at https://github.com/jkrimmer/FINUFFT.jl/tree/user_fftw_lock. However, until the required library adaptations are merged, we can use fftw_make_planner_thread_safe (see #66 ) to resolve this issue.

Cheers

@jkrimmer jkrimmer closed this as completed Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants