New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance of MPI in v3.0 #464
Comments
Hmm interesting. EDIT: Can you give me the following information:
my guess is that numpy multithreads BLAS differently than v2.1's EIGEN, which could lead to weird interplay with MPI on the same system. If you could also test 1,2,4 cores without SR it would be helpful to understand where something is going wrong. |
I think I now understand the problem. There is multithreading happening automatically in v3.0 (OpenMP i guess?), which is not happening in v2.1. When i deactivate multithreading, the performance is about a factor of six faster better than in v2.1 on the example above. So there is actually a significant performance boost in v3.0. Nice! It would be really helpful to have some documentation about the paralellization for us plebeians. |
Yes and no: we don't have and never have had explicit multithreading (except for a few RBM kernels in v2.1), but BLAS does multithread linear algebra operations on sufficiently big arrays. if you want to disable this I'm sure you can do it from numpy check this comment or the threadpoolctl package. About documenting this... yes, you are right, but documenting takes time. (And I'd like to argue that scientific-computing users should be aware of the implicit threading done by BLAS libraries in general. This is not a netket-specific issue, but, yeah... most people don't know, and learn when you hit an issue like that.) |
I think that what we should do in this case is just make sure that only the
root node solves the linear system when using the full solver , then
multithreading shouldn't be an issue
…On Wed, Aug 19, 2020, 21:14 Filippo Vicentini ***@***.***> wrote:
Yes and no: we don't have and never have had explicit multithreading
(except for a few RBM kernels in v2.1), but BLAS does multithread linear
algebra operations on sufficiently big arrays.
Numpy (v3.0) is very aggressive in this multithreading (I guess due to the
GIL forcing single-threaded execution), and it's threshold is lower than
C++'s Eigen (v2.1). Also, Eigen's multithreading on some systems is
automatically disabled, and requires an ENV variable to activate.
if you want to disable this I'm sure you can do it from numpy check this
comment
<numpy/numpy#11826 (comment)> or
the threadpoolctl <https://github.com/joblib/threadpoolctl> package.
About documenting this... yes, you are right, but documenting takes time.
(And I'd like to argue that scientific-computing users should be aware of
the implicit threading done by BLAS libraries in general. This is not a
netket-specific issue, but, yeah... most people don't know, and learn when
you hit an issue like that.)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#464 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGWYRBBXGM5EKBGMVSBKSRLSBQQDDANCNFSM4QBLW34A>
.
|
Are you sure this will be an improvement?
If running across N MPI processes with M total samples, every process will be solving a system where one dimension is M/N, so it will be a smaller one.
And how do you account the case where you are not multithreading, but running across several processes on slum where each process has access only to fewer physical CPU cores?
(Besides, locally, I still think you might hit a performance degradation due to interferences between BLAS threading and MPI processes, which aren’t aware of each other)
Il 19 ago 2020, 22:53 +0200, Giuseppe Carleo <notifications@github.com>, ha scritto:
… I think that what we should do in this case is just make sure that only the
root node solves the linear system when using the full solver , then
multithreading shouldn't be an issue
On Wed, Aug 19, 2020, 21:14 Filippo Vicentini ***@***.***>
wrote:
> Yes and no: we don't have and never have had explicit multithreading
> (except for a few RBM kernels in v2.1), but BLAS does multithread linear
> algebra operations on sufficiently big arrays.
> Numpy (v3.0) is very aggressive in this multithreading (I guess due to the
> GIL forcing single-threaded execution), and it's threshold is lower than
> C++'s Eigen (v2.1). Also, Eigen's multithreading on some systems is
> automatically disabled, and requires an ENV variable to activate.
>
> if you want to disable this I'm sure you can do it from numpy check this
> comment
> <numpy/numpy#11826 (comment)> or
> the threadpoolctl <https://github.com/joblib/threadpoolctl> package.
>
> About documenting this... yes, you are right, but documenting takes time.
> (And I'd like to argue that scientific-computing users should be aware of
> the implicit threading done by BLAS libraries in general. This is not a
> netket-specific issue, but, yeah... most people don't know, and learn when
> you hit an issue like that.)
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#464 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AGWYRBBXGM5EKBGMVSBKSRLSBQQDDANCNFSM4QBLW34A>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Is this the case now in the The question is whether solving the N systems with fewer samples separately and averaging the solution is equivalent (in the presence of numerical error) to solving the full system after computing the joint S matrix and gradient for all samples or whether this has a risk of introducing another source of error. Intuitively, I think this should not be the case for sufficiently many samples in which case your suggestion seems to be clearly preferable to only working on the root node. Maybe we should test this for v3. Since we're discussing parallel performance: In the version 3 Python code, is the MCMC part (i.e., the computation of samples using |
I'm having also problems with parallelisation. I'm not explicitly doing anything with MPI, but currently I am using optuna to find hyperparameters of the wavefunction. Optuna allows to generate hyperparameter proposals and run them in parallel. But it seems that every generated job also performs parallel computations, thus requiring more resources than the computer actually has. It would be good to have something that limits the number of cores. I tried with threadpoolctl, but since it acts on a job belonging to a pool of jobs, it gets confused (as stated in their issues). |
What are you using, numpy, jax, or torch? |
I'm using a jax machine and a jax sampler. If I don't get it to work I'll post details in a couple of days |
As I said above, Jax does some parallelization on his own. Check it’s docs and what ENV variables to set. |
Ok I had to use a somewhat ugly workaround, but it works (unfortunately setting some env variables didn't completely work |
RECAP: this is a non-issue. Maybe we should document the fact that jax and numpy do parallelisation on their own and how to disable it but that is also somewhat known stuff. |
I have run a number of experiments comparing the performance of NK2.1 and NK3.0 on multi cores using mpi. I see very inconsistent performance for NK3.0, which leads me to question whether mpi is implemented effectively (optimally) in v3.0. Below is an example experiment:
NetKet 2.1, heisenberg1d.py, L=60, alpha = 3, symmetries = True, samples = 2000
1 core (no mpi) : 24 s/it
9 cores : 3 s/it
18 cores: 1.5 s/it
36 cores: 2.3 s/it
It seems 18 cores reaches a minimum. This also is the case in all the other experiments i ran, varying #samples, and system size (#variables). Similar performance is seen for conv networks.
For NetKet3.0 (without jax) on the exact same system, I get:
1 core (no mpi): 6 s/it
8 cores : 10 s/it
18 cores : 20 s/it
I have not found any scenario in v3.0 where multicore mode (mpi) helps (Note: the convolutional NN examples do not run in 3.0. The layer functon returns an error).
In all experiments, I am using
without modyfying anything in the library except the parameters for the experiments. Using
mpiexec
does not change anything. Any ideas what is happening?The text was updated successfully, but these errors were encountered: