-
-
Notifications
You must be signed in to change notification settings - Fork 12.2k
Multiprocessing slows down matrix multiplication (strange interaction with MKL) #10145
Description
Summary: Multiprocessing slows down completely independent matrix multiplication.
It seems multiprocessing makes completely independent numpy computation processes slower, even with the number of OpenMP threads controlled.
Test code example:
import time
import multiprocessing as mp
import numpy as np
import mkl
def work(n=5000, tries=5):
"""Compute the product of two random nxn matrices. Return the average timing of [tries] runs."""
np.random.seed()
timings = []
for i in range(tries):
start = time.time()
a = np.random.rand(n, n)
b = np.random.rand(n, n)
res = np.sum(a.dot(b))
stop = time.time()
timings.append(stop - start)
return np.mean(timings)
num_cores = mp.cpu_count()
print("# Cores: {}".format(num_cores))
# Print table header by hand
tab = "\t\t"
print("procs", end=tab)
for i in range(1, num_cores + 1):
print("{}".format(i), end=tab)
print("")
print("threads")
# Try all combinations of threads/processes
for threads in range(1, num_cores + 1):
mkl.set_num_threads(threads)
print("set {} get {}".format(threads, mkl.get_max_threads()), end=tab)
for procs in range(1, num_cores + 1):
pool = mp.Pool()
jobs = [() for _ in range(procs)]
print("{:0.4f}".format(np.mean(pool.starmap(work, jobs))), end=tab)
print("")Expectation: when thread is set to 1, running 4 processes on 4 cores should take roughly the same time as running 1 processes on 1 core.
Result on a laptop with 4 processors (Intel i7-6600U, 2 physical cores) and 16GB memory:
# Cores: 4
procs 1 2 3 4
threads
set 1 get 1 6.4352 7.9569 11.3162 18.4924
set 2 get 2 4.3469 9.6891 14.3225 18.6104
set 3 get 2 4.6828 8.5717 13.1268 16.9085
set 4 get 2 4.1497 8.6765 13.4662 17.2045
The observed number of processes and cores used (from htop) agrees with the reported numbers. For example, for 2 procs and 2 threads, htop reports 2 processes using ~200% of CPU and all 4 cores have ~100% load. Yet even with 1 thread, adding independent processes slows down the code substantially. The computation here is not memory bound and there is no communication between processes.
System info:
$ uname -a
Linux xxx 4.13.12-1-ARCH #1 SMP PREEMPT Wed Nov 8 11:54:06 CET 2017 x86_64 GNU/Linux
$ python --version
Python 3.6.2 :: Anaconda, Inc.
In[9]: numpy.distutils.system_info.get_info("mkl")
Out[9]:
{'define_macros': [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)],
'include_dirs': ['/home/larry/programs/anaconda3/include'],
'libraries': ['mkl_rt', 'pthread'],
'library_dirs': ['/home/larry/programs/anaconda3/lib']}
In [10]: numpy.version.full_version
Out[10]: '1.13.1'
Related reports:
https://stackoverflow.com/questions/15639779/why-does-multiprocessing-use-only-a-single-core-after-i-import-numpy
https://stackoverflow.com/questions/15414027/multiprocessing-pool-makes-numpy-matrix-multiplication-slower
https://stackoverflow.com/questions/26258728/parallel-processing-with-multiprocessing-is-slower-than-sequential
https://stackoverflow.com/questions/47380366/dramatic-slow-down-using-multiprocess-and-numpy-in-python
I tried all the "solutions" mentioned above. None seems to be relevant. I can confirm that the affinity is correct and all cores are utilized.