Skip to content

Multiprocessing slows down matrix multiplication (strange interaction with MKL) #10145

@lzlarryli

Description

@lzlarryli

Summary: Multiprocessing slows down completely independent matrix multiplication.

It seems multiprocessing makes completely independent numpy computation processes slower, even with the number of OpenMP threads controlled.

Test code example:

import time
import multiprocessing as mp

import numpy as np
import mkl

def work(n=5000, tries=5):
    """Compute the product of two random nxn matrices. Return the average timing of [tries] runs."""
    np.random.seed()
    timings = []
    for i in range(tries):
        start = time.time()
        a = np.random.rand(n, n)
        b = np.random.rand(n, n)
        res = np.sum(a.dot(b))
        stop = time.time()
        timings.append(stop - start)
    return np.mean(timings)


num_cores = mp.cpu_count()
print("# Cores: {}".format(num_cores))

# Print table header by hand
tab = "\t\t"
print("procs", end=tab)
for i in range(1, num_cores + 1):
    print("{}".format(i), end=tab)
print("")
print("threads")

# Try all combinations of threads/processes
for threads in range(1, num_cores + 1):
    mkl.set_num_threads(threads)
    print("set {} get {}".format(threads, mkl.get_max_threads()), end=tab)

    for procs in range(1, num_cores + 1):
        pool = mp.Pool()
        jobs = [() for _ in range(procs)]
        print("{:0.4f}".format(np.mean(pool.starmap(work, jobs))), end=tab)
    print("")

Expectation: when thread is set to 1, running 4 processes on 4 cores should take roughly the same time as running 1 processes on 1 core.

Result on a laptop with 4 processors (Intel i7-6600U, 2 physical cores) and 16GB memory:

# Cores: 4
procs           1               2               3               4
threads
set 1 get 1     6.4352          7.9569          11.3162         18.4924
set 2 get 2     4.3469          9.6891          14.3225         18.6104
set 3 get 2     4.6828          8.5717          13.1268         16.9085
set 4 get 2     4.1497          8.6765          13.4662         17.2045

The observed number of processes and cores used (from htop) agrees with the reported numbers. For example, for 2 procs and 2 threads, htop reports 2 processes using ~200% of CPU and all 4 cores have ~100% load. Yet even with 1 thread, adding independent processes slows down the code substantially. The computation here is not memory bound and there is no communication between processes.

System info:

$ uname -a
Linux xxx 4.13.12-1-ARCH #1 SMP PREEMPT Wed Nov 8 11:54:06 CET 2017 x86_64 GNU/Linux

$ python --version
Python 3.6.2 :: Anaconda, Inc.



In[9]: numpy.distutils.system_info.get_info("mkl")
Out[9]: 
{'define_macros': [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)],
 'include_dirs': ['/home/larry/programs/anaconda3/include'],
 'libraries': ['mkl_rt', 'pthread'],
 'library_dirs': ['/home/larry/programs/anaconda3/lib']}
In [10]: numpy.version.full_version
Out[10]: '1.13.1'

Related reports:

https://stackoverflow.com/questions/15639779/why-does-multiprocessing-use-only-a-single-core-after-i-import-numpy
https://stackoverflow.com/questions/15414027/multiprocessing-pool-makes-numpy-matrix-multiplication-slower
https://stackoverflow.com/questions/26258728/parallel-processing-with-multiprocessing-is-slower-than-sequential
https://stackoverflow.com/questions/47380366/dramatic-slow-down-using-multiprocess-and-numpy-in-python

I tried all the "solutions" mentioned above. None seems to be relevant. I can confirm that the affinity is correct and all cores are utilized.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions