Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threading aware multi-threading #16990

Closed
orenbenkiki opened this issue Aug 3, 2020 · 10 comments
Closed

Multi-threading aware multi-threading #16990

orenbenkiki opened this issue Aug 3, 2020 · 10 comments

Comments

@orenbenkiki
Copy link

If I invoke np.corrcoef on a large enough matrix, it uses multiple threads to speed up the computation. In fact, it uses all the CPUs (nproc). On my server, this is a lot (56 threads).

However, if I am using multiple threads myself, and invoke np.corrcoef in each one (again, on a large enough matrix), then each invocation uses its own set of nproc threads.

The result is the OS can see up to nproc^2 threads (in my case, I over 3000 threads!), with all the memory and scheduling overheads this entails (most likely just getting killed due to out-of-memory issues).

One would expect that, when using internal Numpy threads, then Numpy should be itself multi-threading aware, and only use up to nproc total such additional internal threads, regardless of the number of application threads that invoke Numpy.

One could say this isn't Numpy's issue, but an issue of the underlying parallel framework (OpenMP, TBB). That said, TBB is supposed to solve this problem in theory, but I still see it when running Intel's python distribution with python -m tbb. Either "using TBB doesn't mean what you think it means" or there's some deeper issue here.

Reproducing the problem:

Run np.corrcoeff on a large matrix from nproc different Python threads, and watch the system's load average soar up to nproc^2 (most likely the program will be killed before it gets to that point).

@eric-wieser
Copy link
Member

Does this reproduce with the much more elementary np.dot?

@orenbenkiki
Copy link
Author

Yes.

@eric-wieser
Copy link
Member

Then it sounds like this is completely out of our control, as our implementation for your test is likely just:

return cblas_matrixproduct(typenum, ap1, ap2, out);

@rgommers
Copy link
Member

rgommers commented Aug 3, 2020

I guess it depends on your own threading implementation. The answer for NumPy at the moment is: use the threadpoolctl package. See gh-11826 for more details.

@rgommers rgommers closed this as completed Aug 3, 2020
@orenbenkiki
Copy link
Author

@rgommers: Thanks for the pointer to threadpoolctl.

Two notes though:

  1. Given something like threadpoolctl exists, expecting Numpy to take over the responsibility of avoiding thread oversubscription, at least as an option, seems like a less-unreasonable request.

    Until such a time Numpy is extended to do that, I'm adding threadpoolctl calls to my code to battle my oversubscription problem.

  2. The way I read it, the threadpoolctl.threadpool_limits function is global. This does solve the problem well if one has a single top-level parallel loop. It doesn't solve it that well if one has nested levels of parallelism, with a low branching factor in each one.

    For example, consider a recursive divide-and-conquer which only splits into two parallel sub-tasks at each step. If one wants to multiply a matrix in each step (even just in the "leaf" steps), then it is tricky to use threadpoolctl to control oversubscription but still allow for efficient parallelism.

    That said, this is a threadpoolctl issue more than a Numpy issue.

@rgommers
Copy link
Member

rgommers commented Aug 3, 2020

It's not a threadpoolctl issue either, it's a multi-package coordination problem that is very hard to solve in a generic way, in particular when one doesn't have control over a whole stack of packages (like wheels on PyPI that don't know of one another). Read e.g. Composable Multi-Threading and Multi-Processing for Numeric
Libraries
from people at Intel.

@orenbenkiki
Copy link
Author

Fair point - yes, this is hard. BTW, is there any known way to get "composable multi threading" with Numpy?

For example, even using Intel's Python distribution (which presumably uses MKL and TBB), even when running with python -m tbb --ipc as specified in the paper, I still see oversubscription:-(

FWIW, tbb is listed as one of the thread pool APIs by threadpool_info, together with openmp. Do I need to specify some additional flags/options to force Numpy to use TBB instead of OMP in such a setup?

@rgommers
Copy link
Member

rgommers commented Aug 3, 2020

I don't know; NumPy uses neither, it's all MKL at that point. You should look at what libraries are loaded, perhaps it's you using a wrong API for threading, or perhaps it's an Intel packaging problem.

@orenbenkiki
Copy link
Author

I have contacted the relevant Intel maintainers. Seems that currently the TBB implementation has some issues with the ThreadPoolExecutor - so the issue isn't that Numpy doesn't use it, it is that it doesn't work as expected.

The good news is that, if you use the Intel python distribution, and run python -smp --kmp-composability ..., (that is, "Composable OMP"), then the system will only use the available number of hardware threads, regardless of how many threads are issuing Numpy commands.

I am not aware of anything equivalent using the vanilla Python distribution, though. It seems the smp package is only functional in the Intel distribution. Also, I had to patch it, as it was using internal APIs of concurrent.futures which have changed by adding positional arguments with no default values. So either way, it is not a "proper solution" yet.

Bottom line, I'm keeping my code using the threadpoolctl workaround as the default, with an option to disable it if one is lucky enough to have a system which knows to self-limit the number of used threads. Presumably the default behavior would become safe to use in a few years, a decade at most ;-)

@rgommers
Copy link
Member

rgommers commented Aug 4, 2020

Thanks, that's very useful info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants