Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Protect against oversubscription with numba prange / or TBB linked native code #951

Merged
merged 8 commits into from
Oct 25, 2019

Conversation

ogrisel
Copy link
Contributor

@ogrisel ogrisel commented Oct 23, 2019

Add NUMBA_NUM_THREADS and TBB_NUM_THREADS to the list of environment variables.

Not sure if we really need a test for this: this would require adding a test dependency on numba for instance.

@codecov
Copy link

codecov bot commented Oct 23, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@37dbbdb). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #951   +/-   ##
=========================================
  Coverage          ?   95.46%           
=========================================
  Files             ?       45           
  Lines             ?     6610           
  Branches          ?        0           
=========================================
  Hits              ?     6310           
  Misses            ?      300           
  Partials          ?        0
Impacted Files Coverage Δ
joblib/test/test_parallel.py 96.82% <100%> (ø)
joblib/_parallel_backends.py 96.64% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37dbbdb...a3e1df1. Read the comment docs.

@ogrisel
Copy link
Contributor Author

ogrisel commented Oct 23, 2019

I have been running manual testing to try to trigger oversubsription with MKL/TBB and loky (without setting TBB_NUM_THREADS) and it seems that TBB is clever enough not to schedule too many tasks. So maybe this is not necessary.

I am not sure that TBB is cgroups aware though. So maybe we should not merge this too quickly.

@ogrisel ogrisel changed the title Protect against oversubscription with numba prange / or TBB linked native code [WIP] Protect against oversubscription with numba prange / or TBB linked native code Oct 23, 2019
@ogrisel
Copy link
Contributor Author

ogrisel commented Oct 23, 2019

The total number of threads in htop is very high but the wall-clock time of the parallel loop stays very good. I suspect that TBB starts large thread pools in each loky worker but then is smart enough to not schedule too many tasks afterwards.

@ogrisel
Copy link
Contributor Author

ogrisel commented Oct 24, 2019

I have run the following evaluation with loky directly to run benchmark without any oversubscription protection by default:

import numpy as np
import os
from pprint import pprint
from time import time
from loky import ProcessPoolExecutor, cpu_count

data = np.random.randn(1000, 1000)
print(f"one eig, shape={data.shape}:",
      end=" ", flush=True)
tic = time()
np.linalg.eig(data)
print(f"{time() - tic:.3f}s")

e = ProcessPoolExecutor(max_workers=48)
worker_env = e.submit(lambda: os.environ).result()
print("NUM_THREADS env on workers:")
pprint({k: v for k, v in worker_env.items()
        if k.endswith("_NUM_THREADS")})

print(f"warm up numpy on loky workers:",
      end=" ", flush=True)
tic = time()
list(e.map(np.max, range(1000)))
print(f"{time() - tic:.3f}s")

n_iter = cpu_count()
print(f"eig x {n_iter}, shape={data.shape}:",
      end=" ", flush=True)
tic = time()
list(e.map(lambda x: len(np.linalg.eig(x)),
     [data] * n_iter))
print(f"{time() - tic:.3f}s")

Here is the output on a machine with 48 cores (24 physical with HT). We try different threading layers and different kinds of strategies to protect against oversubcription caused by nested parallelism in the worker processes.

  • Disabling nested parallelism on MKL via the sequential threading layer
$ MKL_THREADING_LAYER=sequential python oversubscribe.py 
one eig, shape=(1000, 1000): 1.542s
NUM_THREADS env on workers:
{}
warm up numpy on loky workers: 0.820s
eig x 48, shape=(1000, 1000): 5.935s
  • Default OpenMP without protection: ~9x slowdown:
$ MKL_THREADING_LAYER=omp python oversubscribe.py 
one eig, shape=(1000, 1000): 1.507s
NUM_THREADS env on workers:
{}
warm up numpy on loky workers: 0.755s
eig x 48, shape=(1000, 1000): 49.486s
  • OpenMP with OMP_NUM_THREADS=1: no slowdown
$ MKL_THREADING_LAYER=omp OMP_NUM_THREADS=1 python oversubscribe.py 
one eig, shape=(1000, 1000): 1.508s
NUM_THREADS env on workers:
{'OMP_NUM_THREADS': '1'}
warm up numpy on loky workers: 0.859s
eig x 48, shape=(1000, 1000): 5.721s
  • OpenMP with MKL_NUM_THREADS=1: no slowdown
$ MKL_THREADING_LAYER=omp MKL_NUM_THREADS=1 python oversubscribe.py 
one eig, shape=(1000, 1000): 1.544s
NUM_THREADS env on workers:
{'MKL_NUM_THREADS': '1'}
warm up numpy on loky workers: 0.837s
eig x 48, shape=(1000, 1000): 5.738s
  • TBB no protection: 7x slowdown
$ MKL_THREADING_LAYER=tbb python oversubscribe.py 
one eig, shape=(1000, 1000): 1.260s
NUM_THREADS env on workers:
{}
warm up numpy on loky workers: 0.893s
eig x 48, shape=(1000, 1000): 41.815s
  • TBB with MKL_NUM_THREADS=1: 7x slowdown
$ MKL_THREADING_LAYER=tbb MKL_NUM_THREADS=1 python oversubscribe.py 
one eig, shape=(1000, 1000): 1.266s
NUM_THREADS env on workers:
{'MKL_NUM_THREADS': '1'}
warm up numpy on loky workers: 0.866s
eig x 48, shape=(1000, 1000): 40.717s
  • TBB with IPC_ENABLE=1: 1.5x slowdown
$ MKL_THREADING_LAYER=tbb IPC_ENABLE=1 python oversubscribe.py 
one eig, shape=(1000, 1000): 1.361s
NUM_THREADS env on workers:
{}
warm up numpy on loky workers: 0.771s
eig x 48, shape=(1000, 1000): 9.249s

So in conclusion:

  • TBB by default does indeed suffer from oversubscription when nested under Python processes;
  • MKL_NUM_THREADS has no effect on TBB so we cannot use that to protect against over-subscription;
  • There is no TBB_NUM_THREADS variable for TBB. (I have tried and I also observe the )
  • When using TBB IPC scheduler coordination, the oversubscription problem is partially mitigated but not as well as OpenMP with OMP_NUM_THREADS=1

Those results reproduce some of the results published at SciPy 2018 by @anton-malakhov http://conference.scipy.org/proceedings/scipy2018/pdfs/anton_malakhov.pdf

For joblib, I think we should at least enable IPC coordination between workers by default. We could try to force MKL_THREADING_LAYER=omp with OMP_NUM_THREADS=worker_budget to get maximum performance but this is probably too magick.

@ogrisel
Copy link
Contributor Author

ogrisel commented Oct 25, 2019

By playing with this on a server with docker I also found out that TBB might be subject to over-subscription issues caused by the lack of awareness of Linux cgroup cpu quotas: oneapi-src/oneTBB#190.

@ogrisel ogrisel changed the title [WIP] Protect against oversubscription with numba prange / or TBB linked native code [MRG] Protect against oversubscription with numba prange / or TBB linked native code Oct 25, 2019
@ogrisel ogrisel merged commit e441eec into joblib:master Oct 25, 2019
@ogrisel ogrisel deleted the tbb-oversubscription branch October 25, 2019 13:35
@@ -40,9 +40,12 @@ def __init__(self, nesting_level=None, inner_max_num_threads=None):

MAX_NUM_THREADS_VARS = [
'OMP_NUM_THREADS', 'OPENBLAS_NUM_THREADS', 'MKL_NUM_THREADS',
'BLIS_NUM_THREADS', 'VECLIB_MAXIMUM_THREADS', 'NUMEXPR_NUM_THREADS'
'BLIS_NUM_THREADS', 'VECLIB_MAXIMUM_THREADS', 'TBB_NUM_THREADS',
Copy link

@anton-malakhov anton-malakhov Oct 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is TBB_NUM_THREADS kind of dummy variable? Because there is no such control variable in TBB itself. @ogrisel
I'd suggest adding a comment which documents its internal usage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops indeed, this is a left over I wanted to remove. thanks for the catch.

@anton-malakhov
Copy link

@ogrisel Thanks for reproducing the results! I'm glad it helped. I'm actually rather surprised by this:

For joblib, I think we should at least enable IPC coordination between workers by default. We could try to force MKL_THREADING_LAYER=omp with OMP_NUM_THREADS=worker_budget to get maximum performance but this is probably too magick.

If IPC way is the default, it means we finally have customers and can start improving this mechanism!

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Dec 14, 2019
Release 0.14.1

Configure the loky workers' environment to mitigate oversubsription with nested multi-threaded code in the following case:

allow for a suitable number of threads for numba (NUMBA_NUM_THREADS);
enable Interprocess Communication for scheduler coordination when the nested code uses Threading Building Blocks (TBB) (ENABLE_IPC=1)
joblib/joblib#951

Fix a regression where the loky backend was not reusing previously spawned workers. joblib/joblib#968

Revert joblib/joblib#847 to avoid using pkg_resources that introduced a performance regression under Windows: joblib/joblib#965
@ogrisel
Copy link
Contributor Author

ogrisel commented Jan 14, 2020

If IPC way is the default, it means we finally have customers and can start improving this mechanism!

TBB with IPC is now enabled by default for worker processes spawned by joblib.

Unfortunately, TBB work scheduled from the parent process will not coordinate unless the users set the environment variable prior to launching the parent python process.

@ogrisel
Copy link
Contributor Author

ogrisel commented Jan 14, 2020

@anton-malakhov I would love to get your feedback on oneapi-src/oneTBB#190 BTW :)

@anton-malakhov
Copy link

please allow me some time, I'm on revitalizational vacation now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants