Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with multiprocessing and numpy.fft #8140

Closed
tillahoffmann opened this issue Oct 11, 2016 · 10 comments
Closed

Problems with multiprocessing and numpy.fft #8140

tillahoffmann opened this issue Oct 11, 2016 · 10 comments

Comments

@tillahoffmann
Copy link
Contributor

I would like to compute a set of ffts in parallel using numpy.fft.fft and multiprocessing. Unfortunately, running the ffts in parallel results in a large kernel load.

Here is a minimal example that reproduces the problem:

import numpy as np
import scipy
import scipy.fftpack
import multiprocessing
from argparse import ArgumentParser


SIZE = 10000000


def f_numpy(i):
    x = np.empty(SIZE)
    np.fft.fft(x)
    return i


def f_scipy(i):
    x = np.empty(SIZE)
    scipy.fft(x)
    return i


def f_scipy_rfft(i):
    x = np.empty(SIZE)
    scipy.fftpack.rfft(x)
    return i


functions = {
    'numpy': f_numpy,
    'scipy': f_scipy,
    'scipy_rfft': f_scipy_rfft,
}


def __main__():
    ap = ArgumentParser('fft_test')
    ap.add_argument('--function', '-f', help='method used to calculate the fft', choices=functions, default='numpy')
    ap.add_argument('--single_core', '-s', action='store_true', help='use only a single core')
    ap.add_argument('--method', '-m', help='start method', choices=['fork', 'spawn'], default='fork')
    args = ap.parse_args()

    multiprocessing.set_start_method(args.method)

    # Show the configuration
    print("number of cores: %d" % multiprocessing.cpu_count())
    np.__config__.show()

    # Get the method
    f = functions[args.function]

    # Execute using a single core
    if args.single_core:
        for i in range(multiprocessing.cpu_count()):
            f(i)
            print(i, end=' ')
    # Execute using all cores
    else:
        pool = multiprocessing.Pool()
        for i in pool.map(f, range(multiprocessing.cpu_count())):
            print(i, end=' ')


if __name__ == '__main__':
    __main__()

Running time python fft_test.py gives me the following results:

number of cores: 48
openblas_info:
    library_dirs = ['/home/till/anaconda2/envs/sonalytic/lib']
    define_macros = [('HAVE_CBLAS', None)]
    libraries = ['openblas', 'openblas']
    language = c
openblas_lapack_info:
    library_dirs = ['/home/till/anaconda2/envs/sonalytic/lib']
    define_macros = [('HAVE_CBLAS', None)]
    libraries = ['openblas', 'openblas']
    language = c
blas_opt_info:
    library_dirs = ['/home/till/anaconda2/envs/sonalytic/lib']
    define_macros = [('HAVE_CBLAS', None)]
    libraries = ['openblas', 'openblas']
    language = c
blas_mkl_info:
  NOT AVAILABLE
lapack_opt_info:
    library_dirs = ['/home/till/anaconda2/envs/sonalytic/lib']
    define_macros = [('HAVE_CBLAS', None)]
    libraries = ['openblas', 'openblas']
    language = c

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 

real    0m7.422s
user    0m9.830s
sys 1m26.603s

Running with a single core, i.e. python fft_test.py -s gives

real    1m0.345s
user    0m56.558s
sys 0m2.959s

I thought that using spawn rather than fork might resolve the problem but I had no luck. Any idea what might cause the large kernel wait?

I originally posted this issue on stackoverflow but realised this may be a more appropriate place.

@rgommers
Copy link
Member

Is that with 1.11.x or current master? Could be related to gh-7712

@tillahoffmann
Copy link
Contributor Author

Will have a look at master. Currently running 1.11.2.

@tillahoffmann
Copy link
Contributor Author

I still get the same problem on master.

@juliantaylor
Copy link
Contributor

juliantaylor commented Oct 12, 2016

what kernel version are you using?
the memory allocation is quite large so my guess is that it is the kernel time in the lock required for memory mapping.
you could check that with by profiling what the kernel is doing with perf

note cpu_count is not the right method to get the usable cpus, that is os.sched_getaffinity(0) which respects cpu affinity.

@juliantaylor
Copy link
Contributor

juliantaylor commented Oct 12, 2016

just checked it on a 16 core machine just using np.ones instead of a fft on a 4.4 kernel, the lock used for memory mapping gets heavily contended with higher number of processes and contention explodes when oversubscribing.
Do you have 48 physical cores or are it 24 hyperthreaded cores? try running with 24 jobs only then and the kernel time should go down a bit.

@tillahoffmann
Copy link
Contributor Author

tillahoffmann commented Oct 12, 2016

Thanks for the hints @juliantaylor. I believe the server has 48 physical cores but will check with the person who set it up. Running os.sched_getaffinity(0) gives me a set {0, ..., 47}.

Assuming you were referring to linux kernel version, I get the following from running uname -r: 3.10.0-327.28.3.el7.x86_64.

I will move the memory allocation into the initializer of the multiprocessing call, reduce the allocated memory and use a larger chunksize to see whether that resolves the problem.

Update: the CPU configuration in /proc/cpuinfo reveals

  • 4 cpus (physical id from 0 to 3)
  • 12 cores per cpu (cpu cores is 12 for each cpu)

@tillahoffmann
Copy link
Contributor Author

It turns out the memory allocation was indeed the problem. Changing the script to allocate the memory ahead of time eliminates the kernel wait (using 1.11.2 and master). Closing this issue because my underlying problem probably has nothing to do with numpy but rather with the way I allocate memory.

Thanks for the help, @juliantaylor and @rgommers.

@juliantaylor
Copy link
Contributor

juliantaylor commented Oct 12, 2016

the len() of that set is the number of available cpus, consider taskset -c 1 python ... multiprocessing.cpu_count will not consider that.

@tillahoffmann
Copy link
Contributor Author

The memory allocation contention is still an issue for us when computing spectrograms in different processes on the same machine. I've done a bit more digging and created a minimal reproducible example here.

I'm surprised that calculating spectrograms is so intensive in terms of memory allocation and would have assumed that it is limited by the compute power of the machine. Do you think it is possible to make improvements in terms of memory allocation or are we facing a fundamental limit here?

In particular, running a single process with strace -c attached for ten seconds, I get the following information.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 95.94    0.061604          16      3910           brk
  2.92    0.001876           1      1302           rt_sigaction
  1.11    0.000715           1       651           rt_sigprocmask
  0.02    0.000012           2         7           write
  0.01    0.000006           1         7           getpid
------ ----------- ----------- --------- --------- ----------------
100.00    0.064213                  5877           total

Doing the same when I'm running 64 processes gives me.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.96    2.340887        4425       529           brk
  0.03    0.000809           5       176           rt_sigaction
  0.01    0.000169           2        88           rt_sigprocmask
  0.00    0.000027          27         1           write
  0.00    0.000002           2         1           getpid
------ ----------- ----------- --------- --------- ----------------
100.00    2.341894                   795           total

The above experiments were run in a docker container on a linux host. Here are the details:

  • uname -a (inside container): Linux 2deec7f6ca9d 4.9.87-linuxkit-aufs Misc #1 SMP Wed Mar 14 15:12:16 UTC 2018 x86_64 GNU/Linux
  • uname -a (on host): [host name] 3.16.0-77-generic Add where= parameter to ufuncs #99~14.04.1-Ubuntu SMP Tue Jun 28 19:17:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
  • docker --version: Docker version 17.09.0-ce, build afdb6d4
  • numpy.version.version: 1.14.5
  • scipy.version.version: 1.1.0

Whilst I run the experiments in a container to make it easier to reproduce, running the code outside the container has the same problem.

@charris
Copy link
Member

charris commented Jun 29, 2018

Hmm, that's strange. We do cache some things, but I cannot see how that would cause trouble, not sure how we allocate memory otherwise. Does scipy fft have the same problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants