Multi-threading aware multi-threading #16990

orenbenkiki · 2020-08-03T07:50:43Z

If I invoke np.corrcoef on a large enough matrix, it uses multiple threads to speed up the computation. In fact, it uses all the CPUs (nproc). On my server, this is a lot (56 threads).

However, if I am using multiple threads myself, and invoke np.corrcoef in each one (again, on a large enough matrix), then each invocation uses its own set of nproc threads.

The result is the OS can see up to nproc^2 threads (in my case, I over 3000 threads!), with all the memory and scheduling overheads this entails (most likely just getting killed due to out-of-memory issues).

One would expect that, when using internal Numpy threads, then Numpy should be itself multi-threading aware, and only use up to nproc total such additional internal threads, regardless of the number of application threads that invoke Numpy.

One could say this isn't Numpy's issue, but an issue of the underlying parallel framework (OpenMP, TBB). That said, TBB is supposed to solve this problem in theory, but I still see it when running Intel's python distribution with python -m tbb. Either "using TBB doesn't mean what you think it means" or there's some deeper issue here.

Reproducing the problem:

Run np.corrcoeff on a large matrix from nproc different Python threads, and watch the system's load average soar up to nproc^2 (most likely the program will be killed before it gets to that point).

The text was updated successfully, but these errors were encountered:

eric-wieser · 2020-08-03T08:05:25Z

Does this reproduce with the much more elementary np.dot?

orenbenkiki · 2020-08-03T08:08:51Z

Yes.

eric-wieser · 2020-08-03T08:24:31Z

Then it sounds like this is completely out of our control, as our implementation for your test is likely just:

numpy/numpy/core/src/multiarray/multiarraymodule.c

Line 951 in 0f12338

return cblas_matrixproduct(typenum, ap1, ap2, out);

rgommers · 2020-08-03T09:23:18Z

I guess it depends on your own threading implementation. The answer for NumPy at the moment is: use the threadpoolctl package. See gh-11826 for more details.

orenbenkiki · 2020-08-03T10:54:13Z

@rgommers: Thanks for the pointer to threadpoolctl.

Two notes though:

Given something like threadpoolctl exists, expecting Numpy to take over the responsibility of avoiding thread oversubscription, at least as an option, seems like a less-unreasonable request.

Until such a time Numpy is extended to do that, I'm adding threadpoolctl calls to my code to battle my oversubscription problem.
The way I read it, the threadpoolctl.threadpool_limits function is global. This does solve the problem well if one has a single top-level parallel loop. It doesn't solve it that well if one has nested levels of parallelism, with a low branching factor in each one.

For example, consider a recursive divide-and-conquer which only splits into two parallel sub-tasks at each step. If one wants to multiply a matrix in each step (even just in the "leaf" steps), then it is tricky to use threadpoolctl to control oversubscription but still allow for efficient parallelism.

That said, this is a threadpoolctl issue more than a Numpy issue.

rgommers · 2020-08-03T11:40:37Z

It's not a threadpoolctl issue either, it's a multi-package coordination problem that is very hard to solve in a generic way, in particular when one doesn't have control over a whole stack of packages (like wheels on PyPI that don't know of one another). Read e.g. Composable Multi-Threading and Multi-Processing for Numeric
Libraries from people at Intel.

orenbenkiki · 2020-08-03T12:49:07Z

Fair point - yes, this is hard. BTW, is there any known way to get "composable multi threading" with Numpy?

For example, even using Intel's Python distribution (which presumably uses MKL and TBB), even when running with python -m tbb --ipc as specified in the paper, I still see oversubscription:-(

FWIW, tbb is listed as one of the thread pool APIs by threadpool_info, together with openmp. Do I need to specify some additional flags/options to force Numpy to use TBB instead of OMP in such a setup?

rgommers · 2020-08-03T16:25:41Z

I don't know; NumPy uses neither, it's all MKL at that point. You should look at what libraries are loaded, perhaps it's you using a wrong API for threading, or perhaps it's an Intel packaging problem.

orenbenkiki · 2020-08-04T06:17:55Z

I have contacted the relevant Intel maintainers. Seems that currently the TBB implementation has some issues with the ThreadPoolExecutor - so the issue isn't that Numpy doesn't use it, it is that it doesn't work as expected.

The good news is that, if you use the Intel python distribution, and run python -smp --kmp-composability ..., (that is, "Composable OMP"), then the system will only use the available number of hardware threads, regardless of how many threads are issuing Numpy commands.

I am not aware of anything equivalent using the vanilla Python distribution, though. It seems the smp package is only functional in the Intel distribution. Also, I had to patch it, as it was using internal APIs of concurrent.futures which have changed by adding positional arguments with no default values. So either way, it is not a "proper solution" yet.

Bottom line, I'm keeping my code using the threadpoolctl workaround as the default, with an option to disable it if one is lucky enough to have a system which knows to self-limit the number of used threads. Presumably the default behavior would become safe to use in a few years, a decade at most ;-)

rgommers · 2020-08-04T09:32:37Z

Thanks, that's very useful info.

rgommers closed this as completed Aug 3, 2020

TomWinder mentioned this issue Oct 27, 2020

NumPy multi-threading uses all available threads QuakeMigrate/QuakeMigrate#106

Open

LeHenschel mentioned this issue Sep 5, 2023

Uncontrolled Multi CPU Threading in FastSurfer (even when setting value for --threads) Deep-MI/FastSurfer#371

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threading aware multi-threading #16990

Multi-threading aware multi-threading #16990

orenbenkiki commented Aug 3, 2020

eric-wieser commented Aug 3, 2020

orenbenkiki commented Aug 3, 2020

eric-wieser commented Aug 3, 2020

rgommers commented Aug 3, 2020

orenbenkiki commented Aug 3, 2020

rgommers commented Aug 3, 2020

orenbenkiki commented Aug 3, 2020

rgommers commented Aug 3, 2020

orenbenkiki commented Aug 4, 2020

rgommers commented Aug 4, 2020

Multi-threading aware multi-threading #16990

Multi-threading aware multi-threading #16990

Comments

orenbenkiki commented Aug 3, 2020

Reproducing the problem:

eric-wieser commented Aug 3, 2020

orenbenkiki commented Aug 3, 2020

eric-wieser commented Aug 3, 2020

rgommers commented Aug 3, 2020

orenbenkiki commented Aug 3, 2020

rgommers commented Aug 3, 2020

orenbenkiki commented Aug 3, 2020

rgommers commented Aug 3, 2020

orenbenkiki commented Aug 4, 2020

rgommers commented Aug 4, 2020