-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Protect against oversubscription with numba prange / or TBB linked native code #951
Conversation
Codecov Report
@@ Coverage Diff @@
## master #951 +/- ##
=========================================
Coverage ? 95.46%
=========================================
Files ? 45
Lines ? 6610
Branches ? 0
=========================================
Hits ? 6310
Misses ? 300
Partials ? 0
Continue to review full report at Codecov.
|
I have been running manual testing to try to trigger oversubsription with MKL/TBB and loky (without setting TBB_NUM_THREADS) and it seems that TBB is clever enough not to schedule too many tasks. So maybe this is not necessary. I am not sure that TBB is cgroups aware though. So maybe we should not merge this too quickly. |
The total number of threads in htop is very high but the wall-clock time of the parallel loop stays very good. I suspect that TBB starts large thread pools in each loky worker but then is smart enough to not schedule too many tasks afterwards. |
I have run the following evaluation with loky directly to run benchmark without any oversubscription protection by default: import numpy as np
import os
from pprint import pprint
from time import time
from loky import ProcessPoolExecutor, cpu_count
data = np.random.randn(1000, 1000)
print(f"one eig, shape={data.shape}:",
end=" ", flush=True)
tic = time()
np.linalg.eig(data)
print(f"{time() - tic:.3f}s")
e = ProcessPoolExecutor(max_workers=48)
worker_env = e.submit(lambda: os.environ).result()
print("NUM_THREADS env on workers:")
pprint({k: v for k, v in worker_env.items()
if k.endswith("_NUM_THREADS")})
print(f"warm up numpy on loky workers:",
end=" ", flush=True)
tic = time()
list(e.map(np.max, range(1000)))
print(f"{time() - tic:.3f}s")
n_iter = cpu_count()
print(f"eig x {n_iter}, shape={data.shape}:",
end=" ", flush=True)
tic = time()
list(e.map(lambda x: len(np.linalg.eig(x)),
[data] * n_iter))
print(f"{time() - tic:.3f}s") Here is the output on a machine with 48 cores (24 physical with HT). We try different threading layers and different kinds of strategies to protect against oversubcription caused by nested parallelism in the worker processes.
So in conclusion:
Those results reproduce some of the results published at SciPy 2018 by @anton-malakhov http://conference.scipy.org/proceedings/scipy2018/pdfs/anton_malakhov.pdf For joblib, I think we should at least enable IPC coordination between workers by default. We could try to force |
By playing with this on a server with docker I also found out that TBB might be subject to over-subscription issues caused by the lack of awareness of Linux cgroup cpu quotas: oneapi-src/oneTBB#190. |
@@ -40,9 +40,12 @@ def __init__(self, nesting_level=None, inner_max_num_threads=None): | |||
|
|||
MAX_NUM_THREADS_VARS = [ | |||
'OMP_NUM_THREADS', 'OPENBLAS_NUM_THREADS', 'MKL_NUM_THREADS', | |||
'BLIS_NUM_THREADS', 'VECLIB_MAXIMUM_THREADS', 'NUMEXPR_NUM_THREADS' | |||
'BLIS_NUM_THREADS', 'VECLIB_MAXIMUM_THREADS', 'TBB_NUM_THREADS', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is TBB_NUM_THREADS
kind of dummy variable? Because there is no such control variable in TBB itself. @ogrisel
I'd suggest adding a comment which documents its internal usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops indeed, this is a left over I wanted to remove. thanks for the catch.
@ogrisel Thanks for reproducing the results! I'm glad it helped. I'm actually rather surprised by this:
If IPC way is the default, it means we finally have customers and can start improving this mechanism! |
Release 0.14.1 Configure the loky workers' environment to mitigate oversubsription with nested multi-threaded code in the following case: allow for a suitable number of threads for numba (NUMBA_NUM_THREADS); enable Interprocess Communication for scheduler coordination when the nested code uses Threading Building Blocks (TBB) (ENABLE_IPC=1) joblib/joblib#951 Fix a regression where the loky backend was not reusing previously spawned workers. joblib/joblib#968 Revert joblib/joblib#847 to avoid using pkg_resources that introduced a performance regression under Windows: joblib/joblib#965
TBB with IPC is now enabled by default for worker processes spawned by joblib. Unfortunately, TBB work scheduled from the parent process will not coordinate unless the users set the environment variable prior to launching the parent python process. |
@anton-malakhov I would love to get your feedback on oneapi-src/oneTBB#190 BTW :) |
please allow me some time, I'm on revitalizational vacation now |
Add NUMBA_NUM_THREADS and TBB_NUM_THREADS to the list of environment variables.
Not sure if we really need a test for this: this would require adding a test dependency on numba for instance.