Problems with multiprocessing and numpy.fft #8140

tillahoffmann · 2016-10-11T16:32:23Z

I would like to compute a set of ffts in parallel using numpy.fft.fft and multiprocessing. Unfortunately, running the ffts in parallel results in a large kernel load.

Here is a minimal example that reproduces the problem:

import numpy as np
import scipy
import scipy.fftpack
import multiprocessing
from argparse import ArgumentParser


SIZE = 10000000


def f_numpy(i):
    x = np.empty(SIZE)
    np.fft.fft(x)
    return i


def f_scipy(i):
    x = np.empty(SIZE)
    scipy.fft(x)
    return i


def f_scipy_rfft(i):
    x = np.empty(SIZE)
    scipy.fftpack.rfft(x)
    return i


functions = {
    'numpy': f_numpy,
    'scipy': f_scipy,
    'scipy_rfft': f_scipy_rfft,
}


def __main__():
    ap = ArgumentParser('fft_test')
    ap.add_argument('--function', '-f', help='method used to calculate the fft', choices=functions, default='numpy')
    ap.add_argument('--single_core', '-s', action='store_true', help='use only a single core')
    ap.add_argument('--method', '-m', help='start method', choices=['fork', 'spawn'], default='fork')
    args = ap.parse_args()

    multiprocessing.set_start_method(args.method)

    # Show the configuration
    print("number of cores: %d" % multiprocessing.cpu_count())
    np.__config__.show()

    # Get the method
    f = functions[args.function]

    # Execute using a single core
    if args.single_core:
        for i in range(multiprocessing.cpu_count()):
            f(i)
            print(i, end=' ')
    # Execute using all cores
    else:
        pool = multiprocessing.Pool()
        for i in pool.map(f, range(multiprocessing.cpu_count())):
            print(i, end=' ')


if __name__ == '__main__':
    __main__()

Running time python fft_test.py gives me the following results:

number of cores: 48
openblas_info:
    library_dirs = ['/home/till/anaconda2/envs/sonalytic/lib']
    define_macros = [('HAVE_CBLAS', None)]
    libraries = ['openblas', 'openblas']
    language = c
openblas_lapack_info:
    library_dirs = ['/home/till/anaconda2/envs/sonalytic/lib']
    define_macros = [('HAVE_CBLAS', None)]
    libraries = ['openblas', 'openblas']
    language = c
blas_opt_info:
    library_dirs = ['/home/till/anaconda2/envs/sonalytic/lib']
    define_macros = [('HAVE_CBLAS', None)]
    libraries = ['openblas', 'openblas']
    language = c
blas_mkl_info:
  NOT AVAILABLE
lapack_opt_info:
    library_dirs = ['/home/till/anaconda2/envs/sonalytic/lib']
    define_macros = [('HAVE_CBLAS', None)]
    libraries = ['openblas', 'openblas']
    language = c

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 

real    0m7.422s
user    0m9.830s
sys 1m26.603s

Running with a single core, i.e. python fft_test.py -s gives

real    1m0.345s
user    0m56.558s
sys 0m2.959s

I thought that using spawn rather than fork might resolve the problem but I had no luck. Any idea what might cause the large kernel wait?

I originally posted this issue on stackoverflow but realised this may be a more appropriate place.

The text was updated successfully, but these errors were encountered:

rgommers · 2016-10-12T10:58:11Z

Is that with 1.11.x or current master? Could be related to gh-7712

tillahoffmann · 2016-10-12T10:59:26Z

Will have a look at master. Currently running 1.11.2.

tillahoffmann · 2016-10-12T11:30:54Z

I still get the same problem on master.

juliantaylor · 2016-10-12T12:35:11Z

what kernel version are you using?
the memory allocation is quite large so my guess is that it is the kernel time in the lock required for memory mapping.
you could check that with by profiling what the kernel is doing with perf

note cpu_count is not the right method to get the usable cpus, that is os.sched_getaffinity(0) which respects cpu affinity.

juliantaylor · 2016-10-12T12:49:50Z

just checked it on a 16 core machine just using np.ones instead of a fft on a 4.4 kernel, the lock used for memory mapping gets heavily contended with higher number of processes and contention explodes when oversubscribing.
Do you have 48 physical cores or are it 24 hyperthreaded cores? try running with 24 jobs only then and the kernel time should go down a bit.

tillahoffmann · 2016-10-12T13:24:40Z

Thanks for the hints @juliantaylor. I believe the server has 48 physical cores but will check with the person who set it up. Running os.sched_getaffinity(0) gives me a set {0, ..., 47}.

Assuming you were referring to linux kernel version, I get the following from running uname -r: 3.10.0-327.28.3.el7.x86_64.

I will move the memory allocation into the initializer of the multiprocessing call, reduce the allocated memory and use a larger chunksize to see whether that resolves the problem.

Update: the CPU configuration in /proc/cpuinfo reveals

4 cpus (physical id from 0 to 3)
12 cores per cpu (cpu cores is 12 for each cpu)

tillahoffmann · 2016-10-12T13:52:06Z

It turns out the memory allocation was indeed the problem. Changing the script to allocate the memory ahead of time eliminates the kernel wait (using 1.11.2 and master). Closing this issue because my underlying problem probably has nothing to do with numpy but rather with the way I allocate memory.

Thanks for the help, @juliantaylor and @rgommers.

juliantaylor · 2016-10-12T14:08:42Z

the len() of that set is the number of available cpus, consider taskset -c 1 python ... multiprocessing.cpu_count will not consider that.

tillahoffmann · 2018-06-28T09:58:12Z

The memory allocation contention is still an issue for us when computing spectrograms in different processes on the same machine. I've done a bit more digging and created a minimal reproducible example here.

I'm surprised that calculating spectrograms is so intensive in terms of memory allocation and would have assumed that it is limited by the compute power of the machine. Do you think it is possible to make improvements in terms of memory allocation or are we facing a fundamental limit here?

In particular, running a single process with strace -c attached for ten seconds, I get the following information.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 95.94    0.061604          16      3910           brk
  2.92    0.001876           1      1302           rt_sigaction
  1.11    0.000715           1       651           rt_sigprocmask
  0.02    0.000012           2         7           write
  0.01    0.000006           1         7           getpid
------ ----------- ----------- --------- --------- ----------------
100.00    0.064213                  5877           total

Doing the same when I'm running 64 processes gives me.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.96    2.340887        4425       529           brk
  0.03    0.000809           5       176           rt_sigaction
  0.01    0.000169           2        88           rt_sigprocmask
  0.00    0.000027          27         1           write
  0.00    0.000002           2         1           getpid
------ ----------- ----------- --------- --------- ----------------
100.00    2.341894                   795           total

The above experiments were run in a docker container on a linux host. Here are the details:

uname -a (inside container): Linux 2deec7f6ca9d 4.9.87-linuxkit-aufs Misc #1 SMP Wed Mar 14 15:12:16 UTC 2018 x86_64 GNU/Linux
uname -a (on host): [host name] 3.16.0-77-generic Add where= parameter to ufuncs #99~14.04.1-Ubuntu SMP Tue Jun 28 19:17:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
docker --version: Docker version 17.09.0-ce, build afdb6d4
numpy.version.version: 1.14.5
scipy.version.version: 1.1.0

Whilst I run the experiments in a container to make it easier to reproduce, running the code outside the container has the same problem.

charris · 2018-06-29T19:31:16Z

Hmm, that's strange. We do cache some things, but I cannot see how that would cause trouble, not sure how we allocate memory otherwise. Does scipy fft have the same problem?

tillahoffmann closed this as completed Oct 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with multiprocessing and numpy.fft #8140

Problems with multiprocessing and numpy.fft #8140

tillahoffmann commented Oct 11, 2016

rgommers commented Oct 12, 2016

tillahoffmann commented Oct 12, 2016

tillahoffmann commented Oct 12, 2016

juliantaylor commented Oct 12, 2016 •

edited

juliantaylor commented Oct 12, 2016 •

edited

tillahoffmann commented Oct 12, 2016 •

edited

tillahoffmann commented Oct 12, 2016

juliantaylor commented Oct 12, 2016 •

edited

tillahoffmann commented Jun 28, 2018

charris commented Jun 29, 2018

Problems with multiprocessing and numpy.fft #8140

Problems with multiprocessing and numpy.fft #8140

Comments

tillahoffmann commented Oct 11, 2016

rgommers commented Oct 12, 2016

tillahoffmann commented Oct 12, 2016

tillahoffmann commented Oct 12, 2016

juliantaylor commented Oct 12, 2016 • edited

juliantaylor commented Oct 12, 2016 • edited

tillahoffmann commented Oct 12, 2016 • edited

tillahoffmann commented Oct 12, 2016

juliantaylor commented Oct 12, 2016 • edited

tillahoffmann commented Jun 28, 2018

charris commented Jun 29, 2018

juliantaylor commented Oct 12, 2016 •

edited

juliantaylor commented Oct 12, 2016 •

edited

tillahoffmann commented Oct 12, 2016 •

edited

juliantaylor commented Oct 12, 2016 •

edited