Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize working on Ensemble normalization #94

Merged
merged 3 commits into from Jan 16, 2020
Merged

Conversation

fzeiser
Copy link
Collaborator

@fzeiser fzeiser commented Jan 9, 2020

Used pathos multiprocessing beacause of problem with pickling if using the "normal" multiprocess module.

I also had a look at the #67, the parallelization of the ensemble creation. However, to do this we would have to rewrite the function somewhat more. Currently we want to have access to several arrays from all threads:

ompy/ompy/ensemble.py

Lines 204 to 221 in f4ccc09

raw_ensemble = np.zeros((number, *self.raw.shape))
unfolded_ensemble = np.zeros_like(raw_ensemble)
firstgen_ensemble = np.zeros_like(raw_ensemble)
for step in tqdm(range(number)):
LOG.info(f"Generating {step}")
if self.bg is not None:
prompt_w_bg = self.generate_perturbed(step, method,
state="prompt+bg")
bg = self.generate_perturbed(step, method, state="bg")
raw = self.subtract_bg(step, prompt_w_bg, bg)
else:
raw = self.generate_perturbed(step, method, state="raw")
unfolded = self.unfold(step, raw)
firstgen = self.first_generation(step, unfolded)
raw_ensemble[step, :, :] = raw.values
unfolded_ensemble[step, :, :] = unfolded.values

Which version do you think is preferable? The numpy shared array thing might be faster, but I think it turns out to be less readable, especially for someone who doesn't know this feature too well. We'd get three blocks like this (ok, maybe it's not so terrible ;P, but it was easier to read before):

result = np.ctypeslib.as_ctypes(np.zeros((size, size)))
shared_array = sharedctypes.RawArray(result._type_, result)

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

@fzeiser
Copy link
Collaborator Author

fzeiser commented Jan 9, 2020

But wait a minute, I just recall that multinest has an inbuild mpi support. I'll have to check out that again. It would be better to use that than to build multiprocessing around the multinest code.

@fzeiser
Copy link
Collaborator Author

fzeiser commented Jan 9, 2020

waiting to see how easy it is to install mpi4py and multinest with mpi support. On my machine I have a too old gcc compiler to build mpi4py. I could change it but I thought we might see first how well it works. Hope @vetlewi could install it on his machine.

@fzeiser
Copy link
Collaborator Author

fzeiser commented Jan 9, 2020

It seems to be working smoothly also with the current implementation, using panthos.multiprocessing around the "whole" loop. I'll upload the notebook once the run is finished. There is a small bug though: The loggers for the ensembleNormalize class don't work anylonger. This has probably something todo with the fact that I wrote on the same logger with different processes - which I shouldn't. (Weird note -- up to this very last ran I though I was sure that I the ensemblenorm_seq logger shows up but not the ensemblenorm_sim logger. It makes more sense that they are both gone)

@fzeiser
Copy link
Collaborator Author

fzeiser commented Jan 9, 2020

see that they are not "gone" but written out to the notebook at the wrong place. They are now written at output line [19]; so in the cell starting with
logger = om.introspection.get_logger('ensemble', 'INFO') instead of line [37] and line [39]. This is probably because it's the first time the logger is used for parallel processing there

@fzeiser
Copy link
Collaborator Author

fzeiser commented Jan 9, 2020

Found the issue: I was still using the same pool. I have to run pool.clear in the code after execution, then it works. See also:
uqfoundation/pathos#111

@fzeiser
Copy link
Collaborator Author

fzeiser commented Jan 9, 2020

There was something else I did not think about carefully enough: When parallelizing, one has to take care that the random state of numpy is not just copied, but each process should get a differently spawned/seeded generator. The code is slightly more complex now, but I think it's worth the changes.

@ErlendLima
Copy link
Contributor

Shared memory is desirable only if 1) the memory is limited or 2) copying the memory is a bottleneck. Otherwise it only adds unnecessary complexity to the code. For now your solution is sufficient. I don't quite see how pickling is a problem when using the standard multiprocessing module.

@fzeiser fzeiser force-pushed the dev/parallel branch 2 times, most recently from 9ff79c1 to 97cd212 Compare January 16, 2020 18:28
@fzeiser
Copy link
Collaborator Author

fzeiser commented Jan 16, 2020

Rebased on master, ready to commit.

I don't quite see how pickling is a problem when using the standard multiprocessing module.

See https://stackoverflow.com/questions/19984152/what-can-multiprocessing-and-dill-do-together
and
https://stackoverflow.com/questions/1816958/cant-pickle-type-instancemethod-when-using-multiprocessing-pool-map

Shared memory is desirable only if 1) the memory is limited or 2) copying the memory is a bottleneck. Otherwise it only adds unnecessary complexity to the code. For now your solution is sufficient.

I totally agree, but let's stick to this solution for now.

@fzeiser
Copy link
Collaborator Author

fzeiser commented Jan 16, 2020

Just remembered to set the default number of cpu's to cpu_count -1 if cpu_count>0, as one cpu will usually be occupied with the system anyhow. Uploading changes as soon as it ran.

fzeiser added 2 commits January 16, 2020 20:08
- Used pathos multiprocessing beacause of problem with pickling if using the "normal" multiprocess module.
- added parallization for ensemble class; fixes #67
- Fixing logger issue for multiprocessing(closed pools after usage)
- fixed random seeds for each thread
@fzeiser fzeiser merged commit 2e95d9e into master Jan 16, 2020
@fzeiser fzeiser deleted the dev/parallel branch January 17, 2020 13:34
@vetlewi
Copy link
Collaborator

vetlewi commented Jan 18, 2020

Just remembered to set the default number of cpu's to cpu_count -1 if cpu_count>0, as one cpu will usually be occupied with the system anyhow. Uploading changes as soon as it ran.

Don’t think you need to worry about that. The OS scheduler will take care of it.

fzeiser pushed a commit that referenced this pull request Jan 20, 2020
After parallelization in #94, we had an issue if a matrix had negative Ex entries. They were automatically cut in the fg method. This leads to different sizes of the raw, unfolded and firstgen ensembles -- which created a mess later.

Instead, now the firstgeneration method throws an error for if the input matrix has negative excitation energies.
Additionally, there is a assert statement of the `step` function in ensemble.
@fzeiser
Copy link
Collaborator Author

fzeiser commented Mar 10, 2020

I finally came around yesterday to check the MPI support of multinest and whether that would be a better way to parallelize. It turns out that parallelizing out current problem and ~400 livepoints with MPI makes the calculations slower or provides similar results.

I tested this with PyMultiNests minimal example first. MPI made it slower. This is because the evaluation of each single likelihood was "too quick", such that it didn't pay out to distribute the calculations. When I on the other hand increased the time it takes for each likelihood cacluation ("stupid" mode, inserting a sleep() ), at some point of time it was worth to distribute the calculations with MPI.

My short summary was that with our current likelihood, it takes about 2ms/evaluation. With 400 livepoints, I got a speedup of ~2 when using MPI with 2 (and equivalently for 3) cores. If I in contrast parallelize the outer loop, i.e. run each realization simultaneously, I get a linear speedup until I have n_cores = n_realizations. So For 50 realizations, and not using more cores :), I'd have a much higher speed up by parallelizing the way we currently do it.

A side note:
If I increased to 1000 life_points, I suddenly see a slight sleep up:

  • 1 core: 32s
  • 2 cores, MPI: 19 s
  • 3 core MPI, 17s

The only "disadvantage" of running 50 realizations on 50 different cores is, that I (have to) wait for the slowest realization to end before I continue.

@vetlewi
Copy link
Collaborator

vetlewi commented Mar 11, 2020

I finally came around yesterday to check the MPI support of multinest and whether that would be a better way to parallelize. It turns out that parallelizing out current problem and ~400 livepoints with MPI makes the calculations slower or provides similar results.

I tested this with PyMultiNests minimal example first. MPI made it slower. This is because the evaluation of each single likelihood was "too quick", such that it didn't pay out to distribute the calculations. When I on the other hand increased the time it takes for each likelihood cacluation ("stupid" mode, inserting a sleep() ), at some point of time it was worth to distribute the calculations with MPI.

My short summary was that with our current likelihood, it takes about 2ms/evaluation. With 400 livepoints, I got a speedup of ~2 when using MPI with 2 (and equivalently for 3) cores. If I in contrast parallelize the outer loop, i.e. run each realization simultaneously, I get a linear speedup until I have n_cores = n_realizations. So For 50 realizations, and not using more cores :), I'd have a much higher speed up by parallelizing the way we currently do it.

A side note:
If I increased to 1000 life_points, I suddenly see a slight sleep up:

  • 1 core: 32s
  • 2 cores, MPI: 19 s
  • 3 core MPI, 17s

The only "disadvantage" of running 50 realizations on 50 different cores is, that I (have to) wait for the slowest realization to end before I continue.

Sounds like MPI probably requires quite a lot of time to initialize

@fzeiser
Copy link
Collaborator Author

fzeiser commented Mar 11, 2020

It could be either that it's the time to initialize, or the time to communicate. I'm not quite sure, but I guess the processes have to communicate with each other every time the n lifepoints have been update once. So you get quite some communication, which might will slow down, too

@fzeiser
Copy link
Collaborator Author

fzeiser commented Mar 11, 2020

I didn't show it here, but I think I tried the same game for something which had a runtime of ~6 min -- and seemed to have a slightly better runtime without MPI. Again, this was for a case where the likelihood was very(!) quick to calculate.

@vetlewi
Copy link
Collaborator

vetlewi commented Mar 11, 2020

I didn't show it here, but I think I tried the same game for something which had a runtime of ~6 min -- and seemed to have a slightly better runtime without MPI. Again, this was for a case where the likelihood was very(!) quick to calculate.

If you are using the gcloud VM it might not be optimized for MPI workloads but many independent threads.

@fzeiser
Copy link
Collaborator Author

fzeiser commented Mar 11, 2020

Good point! Still, for now I dont'y see any reason to switch from multiprocessing to MPI.

@fzeiser
Copy link
Collaborator Author

fzeiser commented Mar 11, 2020

I used the VM to test, as I was never able to install mpi4py propperly on my own machine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants