New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize working on Ensemble normalization #94
Conversation
Check out this pull request on You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB. |
But wait a minute, I just recall that multinest has an inbuild mpi support. I'll have to check out that again. It would be better to use that than to build multiprocessing around the multinest code. |
waiting to see how easy it is to install |
It seems to be working smoothly also with the current implementation, using |
see that they are not "gone" but written out to the notebook at the wrong place. They are now written at output |
Found the issue: I was still using the same pool. I have to run |
There was something else I did not think about carefully enough: When parallelizing, one has to take care that the random state of numpy is not just copied, but each process should get a differently spawned/seeded generator. The code is slightly more complex now, but I think it's worth the changes. |
Shared memory is desirable only if 1) the memory is limited or 2) copying the memory is a bottleneck. Otherwise it only adds unnecessary complexity to the code. For now your solution is sufficient. I don't quite see how pickling is a problem when using the standard |
9ff79c1
to
97cd212
Compare
Rebased on master, ready to commit.
See https://stackoverflow.com/questions/19984152/what-can-multiprocessing-and-dill-do-together
I totally agree, but let's stick to this solution for now. |
Just remembered to set the default number of cpu's to |
- Used pathos multiprocessing beacause of problem with pickling if using the "normal" multiprocess module. - added parallization for ensemble class; fixes #67 - Fixing logger issue for multiprocessing(closed pools after usage) - fixed random seeds for each thread
Don’t think you need to worry about that. The OS scheduler will take care of it. |
After parallelization in #94, we had an issue if a matrix had negative Ex entries. They were automatically cut in the fg method. This leads to different sizes of the raw, unfolded and firstgen ensembles -- which created a mess later. Instead, now the firstgeneration method throws an error for if the input matrix has negative excitation energies. Additionally, there is a assert statement of the `step` function in ensemble.
I finally came around yesterday to check the MPI support of I tested this with PyMultiNests minimal example first. MPI made it slower. This is because the evaluation of each single likelihood was "too quick", such that it didn't pay out to distribute the calculations. When I on the other hand increased the time it takes for each likelihood cacluation ("stupid" mode, inserting a sleep() ), at some point of time it was worth to distribute the calculations with MPI. My short summary was that with our current likelihood, it takes about 2ms/evaluation. With 400 livepoints, I got a speedup of ~2 when using MPI with 2 (and equivalently for 3) cores. If I in contrast parallelize the outer loop, i.e. run each realization simultaneously, I get a linear speedup until I have n_cores = n_realizations. So For 50 realizations, and not using more cores :), I'd have a much higher speed up by parallelizing the way we currently do it. A side note:
The only "disadvantage" of running 50 realizations on 50 different cores is, that I (have to) wait for the slowest realization to end before I continue. |
Sounds like MPI probably requires quite a lot of time to initialize |
It could be either that it's the time to initialize, or the time to communicate. I'm not quite sure, but I guess the processes have to communicate with each other every time the n lifepoints have been update once. So you get quite some communication, which might will slow down, too |
I didn't show it here, but I think I tried the same game for something which had a runtime of ~6 min -- and seemed to have a slightly better runtime without MPI. Again, this was for a case where the likelihood was very(!) quick to calculate. |
If you are using the gcloud VM it might not be optimized for MPI workloads but many independent threads. |
Good point! Still, for now I dont'y see any reason to switch from |
I used the VM to test, as I was never able to install |
Used pathos multiprocessing beacause of problem with pickling if using the "normal" multiprocess module.
I also had a look at the #67, the parallelization of the ensemble creation. However, to do this we would have to rewrite the function somewhat more. Currently we want to have access to several arrays from all threads:
ompy/ompy/ensemble.py
Lines 204 to 221 in f4ccc09
shared_memory
(see good example on numpy shared array in https://jonasteuwen.github.io/numpy/python/multiprocessing/2017/01/07/multiprocessing-numpy-array.html) to shareraw_enseble
,unfolder_ensemble
andfirstgen_ensemble
.(raw, unf, fg)
, which we would cast into numpy arrays in the next step.Which version do you think is preferable? The numpy shared array thing might be faster, but I think it turns out to be less readable, especially for someone who doesn't know this feature too well. We'd get three blocks like this (ok, maybe it's not so terrible ;P, but it was easier to read before):