[RFC] Make n_chains set the total number of chains across all MPI processes #706

PhilipVinc · 2021-05-12T11:36:03Z

Recently we had several people confused by the fact that MPI does not particularly improve performance.
There are two issues:

They don't read the documentation (and we don't have a page on MPI)
Our n_chains is a rank-local property, so that if you increase number of MPI ranks you get more chains. However the number of samples is kept fixed.

Point 1) can be solved with better docs.

Point 2) is about inconsistency with the way we set n_samples. I propose to change the bahviour of n_chains so that it sets the number of chains globally according to the formula

n_chains_per_rank = n_chains_per_rank = max(
                    int(np.ceil(n_chains / mpi.n_nodes)), 1
                )

One can still specify n_chains_per_rank if he so desires.

This is just a skeleton implementaiton (though it should mostly work).
As fixing tests everywhere to use everywhere n_chains_per_rank instead of n_chains will take some time, i'll finish this PR only if we get consensus on this.

Note that it will be a fairly breaking change in the behaviour (though it won't technically break code)

github-actions · 2021-05-12T11:38:42Z

Hello and thanks for your Contribution!
I will be building previews of the updated documentation at the following link:
https://netket.github.io/netket/preview/pv/n_chains

Once the PR is closed or merged, the preview will be automatically deleted.

femtobit

I'm very much in favor of this change. n_samples and n_chains should either both be rank-local or both global (and the former option is just confusing).

netket/sampler/base.py

PhilipVinc · 2021-05-14T10:38:56Z

Pff.
This plays very badly with Flax struct/dataclass.
I think we should roll our own dataclass.
It should be not much work.
maybe i'll do this at some point

PhilipVinc · 2021-05-17T13:10:25Z

This is now rebased on top of #716.
It works.
So If we merge 716 we can have this.

codecov-commenter · 2021-05-17T13:12:18Z

Codecov Report

Merging #706 (0e1a85f) into master (5340864) will decrease coverage by 0.33%.
The diff coverage is 54.38%.

@@            Coverage Diff             @@
##           master     #706      +/-   ##
==========================================
- Coverage   69.47%   69.13%   -0.34%     
==========================================
  Files         216      216              
  Lines       12480    12509      +29     
  Branches     1809     1817       +8     
==========================================
- Hits         8670     8648      -22     
- Misses       3335     3378      +43     
- Partials      475      483       +8

Impacted Files	Coverage Δ
netket/sampler/exact.py	`85.71% <ø> (ø)`
netket/variational/mc_mixed_state.py	`87.93% <ø> (ø)`
netket/sampler/metropolis_pmap.py	`48.75% <8.33%> (+0.60%)`	⬆️
netket/variational/mc_state.py	`81.01% <60.00%> (-1.38%)`	⬇️
netket/sampler/base.py	`77.45% <65.38%> (-3.51%)`	⬇️
netket/sampler/metropolis.py	`84.37% <100.00%> (ø)`
netket/utils/__init__.py	`100.00% <100.00%> (ø)`
netket/utils/mpi/mpi.py	`53.48% <0.00%> (-30.24%)`	⬇️
netket/legacy/stats/_sum_inplace.py	`52.94% <0.00%> (-21.57%)`	⬇️
netket/utils/mpi/primitives.py	`35.82% <0.00%> (-13.44%)`	⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5340864...0e1a85f. Read the comment docs.

netket/sampler/base.py

PhilipVinc · 2021-05-20T09:44:40Z

@gcarleo Do we want to do this?

PhilipVinc · 2021-05-20T09:46:15Z

What this does, in the end, is that samplers can be build with

sa = nk.sampler.MetropolisLocal(n_chains=X)

and when run under MPI, every rank will use

n_chains_per_rank = n_chains_per_rank = max(
                    int(np.ceil(n_chains / mpi.n_nodes)), 1
                )

or you can create them with

sa = nk.sampler.MetropolisLocal(n_chains_per_rank=X)

that will match current behaviour.

Under nonmpi nothing changes

gcarleo · 2021-05-20T09:48:48Z

Yes, I am just worried that if one leaves n_chains=16 and runs on 1000 MPI ranks might be really surprised by the new behavior...

PhilipVinc · 2021-05-20T09:50:51Z

What will happen in this case is 1 chain per rank + warning saying that 1*1000 != 16

The cleanest way to do this is usually to make n_chains an error and have n_chains_per_rank and n_chains_total and in a future release deprecate n_chains_total and go back to n_chains.

But that would break everything for people running stuff locally.

gcarleo · 2021-05-20T09:53:00Z

yeah I mean, I think this change is consistent with the fact that n_samplesfor us is really the total number of samples, not the n_samples_per_rank (btw, that might make sense too...) I am not against merging this actually

PhilipVinc · 2021-05-20T09:56:15Z

n_samples_per_rank I can add that in this PR.

I think a good alternative would be to print a warning always when running under MPI with n_chains for this release (warning can be disabled with a flag) saying that the behaviour changed.
Then we get rid of the warning in the next release (cc @femtobit)

femtobit

n_samples_per_rank I can add that in this PR.

Yes, that would be nice to have for consistency. Either way, I'm happy with this PR.

I think a good alternative would be to print a warning always when running under MPI with n_chains for this release (warning can be disabled with a flag) saying that the behaviour changed.
Then we get rid of the warning in the next release (cc @femtobit)

Maybe... It'd be a warning that is displayed essentially every time NetKet is run, so it'd be pretty prominent (which is good to get people to notice, but can also be annoying - the flag helps but needs to be specified all the time). I'm undecided, feel free to do what you think is best.

PhilipVinc · 2021-05-27T08:36:11Z

So if @gcarleo agrees I'll add

n_samples_per_rank to MCVariationalState.

And change the behaviour so that

n_chains becomes n_chains_per_rank
n_chains will now set the global number of chains.

If n_chains is not perfectly divisible by the number of ranks we print a warning, only on rank 0.
I know it's annoying but I think this is the correct thing to do.
Regardless, when you run stuff under MPI you already have some visual noise so I think this is not so bad.

gcarleo · 2021-05-27T08:46:28Z

Ok yes please add n_samples_per_rank and change n_chains accordingly, this looks like a good solution

fixup dtype fixup! impro Update netket/sampler/base.py Co-authored-by: Damian Hofmann <femtobit@users.noreply.github.com> black

femtobit reviewed May 12, 2021

View reviewed changes

netket/sampler/base.py Outdated Show resolved Hide resolved

netket/sampler/base.py Outdated Show resolved Hide resolved

netket/sampler/base.py Outdated Show resolved Hide resolved

PhilipVinc mentioned this pull request May 17, 2021

NetKet Dataclass #716

Merged

PhilipVinc force-pushed the pv/n_chains branch from 7dfdb61 to 9ef97f5 Compare May 17, 2021 13:03

femtobit reviewed May 17, 2021

View reviewed changes

netket/sampler/base.py Outdated Show resolved Hide resolved

PhilipVinc force-pushed the pv/n_chains branch from a633988 to 627c3ea Compare May 18, 2021 08:48

PhilipVinc marked this pull request as ready for review May 20, 2021 09:44

femtobit approved these changes May 27, 2021

View reviewed changes

PhilipVinc added 3 commits May 27, 2021 10:53

use global n_chains

fda8b64

fixup dtype fixup! impro Update netket/sampler/base.py Co-authored-by: Damian Hofmann <femtobit@users.noreply.github.com> black

fixes

54df195

changelog

0e1a85f

PhilipVinc force-pushed the pv/n_chains branch from f44a747 to 0e1a85f Compare May 27, 2021 10:28

PhilipVinc merged commit 1c5a48b into master May 27, 2021

PhilipVinc deleted the pv/n_chains branch May 27, 2021 12:18

PhilipVinc mentioned this pull request Jun 6, 2021

Rename MCState n_discard to n_discard_per_chain #739

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Make n_chains set the total number of chains across all MPI processes #706

[RFC] Make n_chains set the total number of chains across all MPI processes #706

PhilipVinc commented May 12, 2021

github-actions bot commented May 12, 2021

femtobit left a comment

PhilipVinc commented May 14, 2021

PhilipVinc commented May 17, 2021

codecov-commenter commented May 17, 2021 •

edited

Loading

PhilipVinc commented May 20, 2021

PhilipVinc commented May 20, 2021

gcarleo commented May 20, 2021

PhilipVinc commented May 20, 2021

gcarleo commented May 20, 2021 •

edited

Loading

PhilipVinc commented May 20, 2021 •

edited

Loading

femtobit left a comment

PhilipVinc commented May 27, 2021

gcarleo commented May 27, 2021

[RFC] Make n_chains set the total number of chains across all MPI processes #706

[RFC] Make n_chains set the total number of chains across all MPI processes #706

Conversation

PhilipVinc commented May 12, 2021

github-actions bot commented May 12, 2021

femtobit left a comment

Choose a reason for hiding this comment

PhilipVinc commented May 14, 2021

PhilipVinc commented May 17, 2021

codecov-commenter commented May 17, 2021 • edited Loading

Codecov Report

PhilipVinc commented May 20, 2021

PhilipVinc commented May 20, 2021

gcarleo commented May 20, 2021

PhilipVinc commented May 20, 2021

gcarleo commented May 20, 2021 • edited Loading

PhilipVinc commented May 20, 2021 • edited Loading

femtobit left a comment

Choose a reason for hiding this comment

PhilipVinc commented May 27, 2021

gcarleo commented May 27, 2021

codecov-commenter commented May 17, 2021 •

edited

Loading

gcarleo commented May 20, 2021 •

edited

Loading

PhilipVinc commented May 20, 2021 •

edited

Loading