Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set number of threads after numpy import #11826

Closed
paugier opened this issue Aug 28, 2018 · 48 comments
Closed

Set number of threads after numpy import #11826

paugier opened this issue Aug 28, 2018 · 48 comments

Comments

@paugier
Copy link

paugier commented Aug 28, 2018

This is not a bug report but just an enhancement proposal.

I think it would be useful and important to be able to easily set the number of threads used by Numpy after Numpy import.

From the perspective of library developers, it is often useful to be able to control the number of threads used by Numpy, see for example biopython/biopython#1401.

It is not difficult to do in a simple script when we are sure that Numpy or Scipy have not been imported previously with something like:

import os
os.environ["OMP_NUM_THREADS"] = "1"
import numpy as np

However, in a library there is a good chance that the user has already imported Numpy in its main script, with something like

import numpy as np

import fluidimage

In this case, I don't see how to set the number of threads used by Numpy from the fluidimage code.

Thus, it would be very convenient to have a function np.set_num_threads.

@mattip
Copy link
Member

mattip commented Aug 28, 2018

We would have to make this pluggable somehow to adapt for different linalg implementations. For instance, OpenBLAS seems to expose a openblas_set_num_threads, but MKL would use some other function

@paugier
Copy link
Author

paugier commented Aug 29, 2018

MKL has a function void mkl_set_num_threads( int nt );

https://software.intel.com/en-us/mkl-developer-reference-c-mkl-set-num-threads

which can be call through https://pypi.org/project/mkl/

But it seems to me it would be cleaner to have a generic Numpy function for this.

Most Numpy users don't even know which linalg library is used in the background.

@touqir14
Copy link
Contributor

If its alright, I would like to work on this. I haven't contributed to numpy before, so I will have to be a bit familiar with the codebase. Hence, it would be great if I could get some directions.

@mattip
Copy link
Member

mattip commented Sep 15, 2018

@touqir14 see the developer documentation. You should write some tests, tests that try out the new functions, verifying that they indeed set the number of threads you desire (perhaps by writing a C-level tests-only function that calls NPY_BEGIN_THREADS and ends with NPY_END_THREADS, and in the middle somehow counts the number of threads started) and then write code to implement the functions.

@juliantaylor
Copy link
Contributor

juliantaylor commented Sep 16, 2018

note that this is actually quite tricky to tackle.
NumPy does not actually know which BLAS library is used to implement the functions, it just assumes that whatever implements it uses standard BLAS apis and abis. The underlying implementation can also be switched without relying on recompilation of numpy (so abi compatibility, this is used e.g. in debian based systems).

As there unfortunately is no standard api to determine information on the provider numpy would be relient on runtime introspection to determine the actual provider of the functions. As there are not very many of them (some I recall are openblas, atlas, blis, mkl and reference blas) it might be possible but still difficult to get working portably.

@touqir14
Copy link
Contributor

touqir14 commented Sep 16, 2018 via email

@FabianIsensee
Copy link

FabianIsensee commented Sep 26, 2018

This is an extremely annoying problem because I want to explicitly turn off multiprocessing for my background workers and as far as I can see there is no way of properly doing that (except downgrading numpy to < 1.14). OMP_NUM_THREADS=1 will solve my problem for the most part, but it will prevent my entire python process, including the main thread, from using openmp multiprocessing while I would like to only disable it in the background workers. As of now, I have 8 background workers (multiprocessing.Process), each of which will spawn another 8 threads to do np.dot computation (which is only a very small part of what they are actually doing). That clogs up the entire CPU. Any suggestions of how I could solve this?

Edit: downgrading to numpy 1.14.5 solves the problem. Starting form numpy 1.14.6 it's there

@mattip
Copy link
Member

mattip commented Sep 26, 2018

@FabianIsensee The problem is that NumPy does not know right now what linalg backend you are using, OpenBlas, MKL, ATLAS, ... Did you see this stackoverflow question mentioned above, which suggests functions you can wrap and call to control this for your backend.

@FabianIsensee
Copy link

@mattip Thanks for pointing that out! That is a good solution to the problem, provided that the system you are running on uses openBLAS. Unfortunately this solution has no effect on my system. What's especially annoying for me is that I am providing an open source framework that runs into this problem and I cannot know what blas library each and every one of the users is going to use.
I am not familiar with the numpy releases but the multi-threaded matmul appear starting from version 1.14.6. I checked the release notes and could not spot anything that would have enabled that (or am I blind?). Since multithreading these operations did not happen prior to that there must be a way to control that from within numpy (none of my libraries changed)

@charris
Copy link
Member

charris commented Sep 27, 2018

The change in 1.14.6 was building manylinux1 against OpenBLAS instead of ATLAS, we were already using OpenBLAS for Windows. As far as the build goes, 1.14.6 is pretty much 1.15.x. It sounds like we need to find a solution to this that doesn't depend on the user knowing which library is in use, probably adding some function in numpy that stores its info during the build.

@FabianIsensee
Copy link

Thank you for the clarification!
In an ideal world I would like to either specifically enable/disable multithreading by setting some property after importing numpy or even be able to set the number of threads to use manually. Would something like this be possible?

@charris
Copy link
Member

charris commented Sep 27, 2018

Would something like this be possible?

I don't know, but it sounds like we need it. If nothing else we can pass info upstream and try to encourage the libraries themselves to add such a feature, ideally as some sort of standard library interface in BLAS (LAPACK) itself. @njsmith IIRC, you reported that there was some work going on to produce a new standard? In any case, @matthew-brett I think we could have some effect on OpenBLAS. Maybe there is already such a feature.

@charris
Copy link
Member

charris commented Sep 27, 2018

Although with a single dynamic library I don't see how one could coordinate between callers. Hmm, not a simple problem, almost something that needs to be handled at the OS level. This is getting beyond my expertise.

@mattip
Copy link
Member

mattip commented Sep 27, 2018

@FabianIsensee this seems troubling

Unfortunately this solution has no effect on my system

What exactly did you try and what was the result?

Edit: formatting

@FabianIsensee
Copy link

@mattip I did exactly what was decribed in the comment you referenced above.

    with num_threads(1):
        import numpy as np
        a = np.random.random((20000000, 3))
        b = np.random.random((3, 3))
        for _ in range(10):
            c = np.dot(a, b)
    import numpy as np
    a = np.random.random((20000000, 3))
    b = np.random.random((3, 3))
    for _ in range(10):
        c = np.dot(a, b)

These are the two examples I compared. I ran them while looking at htop to see CPU usage. For both of them a number of threads was spawned and CPU usage was above 100% for the main thread.

Running it like this OMP_NUM_THREADS=1 python playground/numpy_threading_problem.py gives the intended behavior on my computer, but this may not transfer to other machines.

@stuarteberg
Copy link
Contributor

stuarteberg commented Oct 19, 2018

Whether or not a numpy API is feasible for this feature, perhaps we can crowdsource a new section in the numpy docs to explain this issue and offer advice with respect to the different environment variables.

IIUC, the basic advice for OpenBLAS and MKL users would be: If you're planning to use multiple processes (e.g. via multiprocessing, dask, pyspark, etc.), set OMP_NUM_THREADS=1 before running your program. If not, you can leave it unset. (It would be nice to explain why this is important, too.)

Beyond that, maybe an MKL expert can offer more fine-grained advice with respect to these variables:
https://software.intel.com/en-us/mkl-linux-developer-guide-intel-mkl-specific-environment-variables-for-openmp-threading-control

For ATLAS users, apparently the number of threads is predetermined at compile time:
http://math-atlas.sourceforge.net/faq.html#tnum

@charris wrote:

Although with a single dynamic library I don't see how one could coordinate between callers.

I don't quite grok this point, which is all the more reason I would love to see some docs on this general topic. In which section of the numpy docs should I start a PR on the topic of multithreading control?

@bbbbbbbbba
Copy link

IIUC, the basic advice for OpenBLAS and MKL users would be: If you're planning to use multiple processes (e.g. via multiprocessing, dask, pyspark, etc.), set OMP_NUM_THREADS=1 before running your program. If not, you can leave it unset. (It would be nice to explain why this is important, too.)

Correct me if I'm wrong, but it's not a big problem with multiprocessing, right? At worst you pay some extra overhead for parallelizing more than you have CPU cores, but each process spawns its own threads and work correctly. It's multithreading that is really problematic and causes bugs like #11046.

@FabianIsensee
Copy link

FabianIsensee commented Oct 30, 2018

I would very much disagree here. In the specific situation that I am in, I have a pool of background workers (multiprocessing.Process) that generate batches for a deep learning algorithm. These batches contain images where some of these images need to be (among other things) rotated for data augmentation (which is implemented via matrix multiplication of image coordinates). The machine I am working on is a dgx1 computer with 8 graphics cards and 80 CPU threads. Usually I train 8 different networks on it simultaneously using 10 workers and 1 GPU each. Now each of these workers (80 in total) will attempt to do these matrix multiplications (which are quite tiny by the way) in a multithreaded way and since each worker sees 80 cpus they will spawn 80 threads each, resulting in the system being completely clogged up. That effectively breaks everything for me and the only way I can continue my work is to downgrade to numpy 1.14.5

@stuarteberg
Copy link
Contributor

Correct me if I'm wrong, but it's not a big problem with multiprocessing, right? At worst you pay some extra overhead for parallelizing more than you have CPU cores

I think we're on the same page, @bbbbbbbbba. In the case of multiprocessing, I'm not worried about incorrect results. As you said, I'm worried about poor performance due to spawning more threads than you can schedule onto your CPU cores.

But the penalty is not trivial! In one of my recent use-cases on a 16-core machine, the performance was 6x worse due to the extra threads. On an 80-core machine like @FabianIsensee's, the overhead must be even worse.

The threading-related issue you referred to is troubling, but that sounds like an outright bug, not an issue with the OMP_NUM_THREADS setting. Hopefully someone will eventually implement a fix in OpenBLAS or a workaround numpy.

@mattip
Copy link
Member

mattip commented Nov 1, 2018

This code which is MIT licensed and based on other BSD-licensed routines at runtime probes the loaded DLLs (shared objects) to find which implementation is relevant and calls the implementation specific routine to set the number of threads. Thanks to @ogrisel for this comment pointing it out

@ogrisel
Copy link
Contributor

ogrisel commented Nov 1, 2018

This code which is MIT licensed and based on other BSD-licensed routines at runtime probes the loaded DLLs (shared objects) to find which implementation is relevant and calls the implementation specific routine to set the number of threads. Thanks to @ogrisel for this comment pointing it out

Sorry for being late to the party I had not seen this issue. Indeed we started to investigate over-subscription issues a bit in the context of scikit-learn / joblib but this is still work in progress. Having numpy expose a uniform API to control the behavior of the underlying BLAS thread pool would be nice. Note that @anton-malakhov and @tomMoral are the primary authors of those dynamic ctypes-based access to the underlying runtime libraries.

Ping @jeremiedbb who might also be interested in following this discussion.

@seberg

This comment has been minimized.

@seberg

This comment has been minimized.

@mattip
Copy link
Member

mattip commented Nov 5, 2018

@seberg the script I linked to uses ctypes and OS-provided functions to walk down the loaded shared objects looking for the one we want. Isn't that what your script does, only using Popen([ldd ...)`? It seems that if the clib provides [dl_iterate_phdr]https://linux.die.net/man/3/dl_iterate_phdr) for linux, _dyld_image_count for MacOS and GetModuleFileNameExW for windows we should use them.

@seberg

This comment has been minimized.

@seberg
Copy link
Member

seberg commented Nov 6, 2018

Ah sorry, forget about the ldd stuff, I just added it because I kept looking at it. No, the first function just loads the multiarray.so with ctypes and checks if certain function symbols are defined... That seems to work for openblas, mkl, blis and atlas. But I have no idea if just trying to load function symbols should work, or if e.g. accelerate is identifiable by the existance of such a symbol.

EDIT: OK, nevermind my rambling. Tried on windows and the stuff probably just randomly works on linux.

@mattip
Copy link
Member

mattip commented May 26, 2019

Can we mark this as Closed? Maybe we should pivot it to "document use of threadpoolctl to control threads"?

@mattip
Copy link
Member

mattip commented Aug 18, 2019

Closing, since the threadpoolctl package handles this.

@mattip mattip closed this as completed Aug 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.