Parallelize working on Ensemble normalization #94

fzeiser · 2020-01-09T11:21:41Z

Used pathos multiprocessing beacause of problem with pickling if using the "normal" multiprocess module.

I also had a look at the #67, the parallelization of the ensemble creation. However, to do this we would have to rewrite the function somewhat more. Currently we want to have access to several arrays from all threads:

ompy/ompy/ensemble.py

Lines 204 to 221 in f4ccc09

    
           raw_ensemble = np.zeros((number, *self.raw.shape)) 
        
           unfolded_ensemble = np.zeros_like(raw_ensemble) 
        
           firstgen_ensemble = np.zeros_like(raw_ensemble) 
        
           for step in tqdm(range(number)): 
        
               LOG.info(f"Generating {step}") 
        
               if self.bg is not None: 
        
                   prompt_w_bg = self.generate_perturbed(step, method, 
        
                                                         state="prompt+bg") 
        
                   bg = self.generate_perturbed(step, method, state="bg") 
        
                   raw = self.subtract_bg(step, prompt_w_bg, bg) 
        
               else: 
        
                   raw = self.generate_perturbed(step, method, state="raw") 
        
               unfolded = self.unfold(step, raw) 
        
               firstgen = self.first_generation(step, unfolded) 
        
               raw_ensemble[step, :, :] = raw.values 
        
               unfolded_ensemble[step, :, :] = unfolded.values

Either we could you shared_memory (see good example on numpy shared array in https://jonasteuwen.github.io/numpy/python/multiprocessing/2017/01/07/multiprocessing-numpy-array.html) to share raw_enseble, unfolder_ensemble and firstgen_ensemble.
Otherwise one could rewrite the function to be iterated over such that it returns a tuple (raw, unf, fg), which we would cast into numpy arrays in the next step.

Which version do you think is preferable? The numpy shared array thing might be faster, but I think it turns out to be less readable, especially for someone who doesn't know this feature too well. We'd get three blocks like this (ok, maybe it's not so terrible ;P, but it was easier to read before):

result = np.ctypeslib.as_ctypes(np.zeros((size, size)))
shared_array = sharedctypes.RawArray(result._type_, result)

review-notebook-app · 2020-01-09T11:21:47Z

Check out this pull request on

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

fzeiser · 2020-01-09T12:22:19Z

But wait a minute, I just recall that multinest has an inbuild mpi support. I'll have to check out that again. It would be better to use that than to build multiprocessing around the multinest code.

fzeiser · 2020-01-09T14:49:26Z

waiting to see how easy it is to install mpi4py and multinest with mpi support. On my machine I have a too old gcc compiler to build mpi4py. I could change it but I thought we might see first how well it works. Hope @vetlewi could install it on his machine.

fzeiser · 2020-01-09T14:54:20Z

It seems to be working smoothly also with the current implementation, using panthos.multiprocessing around the "whole" loop. I'll upload the notebook once the run is finished. There is a small bug though: The loggers for the ensembleNormalize class don't work anylonger. This has probably something todo with the fact that I wrote on the same logger with different processes - which I shouldn't. (Weird note -- up to this very last ran I though I was sure that I the ensemblenorm_seq logger shows up but not the ensemblenorm_sim logger. It makes more sense that they are both gone)

fzeiser · 2020-01-09T14:58:22Z

see that they are not "gone" but written out to the notebook at the wrong place. They are now written at output line [19]; so in the cell starting with
logger = om.introspection.get_logger('ensemble', 'INFO') instead of line [37] and line [39]. This is probably because it's the first time the logger is used for parallel processing there

fzeiser · 2020-01-09T15:27:23Z

Found the issue: I was still using the same pool. I have to run pool.clear in the code after execution, then it works. See also:
uqfoundation/pathos#111

fzeiser · 2020-01-09T17:37:55Z

There was something else I did not think about carefully enough: When parallelizing, one has to take care that the random state of numpy is not just copied, but each process should get a differently spawned/seeded generator. The code is slightly more complex now, but I think it's worth the changes.

ErlendLima · 2020-01-11T14:55:17Z

Shared memory is desirable only if 1) the memory is limited or 2) copying the memory is a bottleneck. Otherwise it only adds unnecessary complexity to the code. For now your solution is sufficient. I don't quite see how pickling is a problem when using the standard multiprocessing module.

fzeiser · 2020-01-16T18:33:07Z

Rebased on master, ready to commit.

I don't quite see how pickling is a problem when using the standard multiprocessing module.

See https://stackoverflow.com/questions/19984152/what-can-multiprocessing-and-dill-do-together
and
https://stackoverflow.com/questions/1816958/cant-pickle-type-instancemethod-when-using-multiprocessing-pool-map

Shared memory is desirable only if 1) the memory is limited or 2) copying the memory is a bottleneck. Otherwise it only adds unnecessary complexity to the code. For now your solution is sufficient.

I totally agree, but let's stick to this solution for now.

fzeiser · 2020-01-16T18:37:42Z

Just remembered to set the default number of cpu's to cpu_count -1 if cpu_count>0, as one cpu will usually be occupied with the system anyhow. Uploading changes as soon as it ran.

- Used pathos multiprocessing beacause of problem with pickling if using the "normal" multiprocess module. - added parallization for ensemble class; fixes #67 - Fixing logger issue for multiprocessing(closed pools after usage) - fixed random seeds for each thread

vetlewi · 2020-01-18T10:09:22Z

Just remembered to set the default number of cpu's to cpu_count -1 if cpu_count>0, as one cpu will usually be occupied with the system anyhow. Uploading changes as soon as it ran.

Don’t think you need to worry about that. The OS scheduler will take care of it.

After parallelization in #94, we had an issue if a matrix had negative Ex entries. They were automatically cut in the fg method. This leads to different sizes of the raw, unfolded and firstgen ensembles -- which created a mess later. Instead, now the firstgeneration method throws an error for if the input matrix has negative excitation energies. Additionally, there is a assert statement of the `step` function in ensemble.

fzeiser · 2020-03-10T14:35:04Z

I finally came around yesterday to check the MPI support of multinest and whether that would be a better way to parallelize. It turns out that parallelizing out current problem and ~400 livepoints with MPI makes the calculations slower or provides similar results.

I tested this with PyMultiNests minimal example first. MPI made it slower. This is because the evaluation of each single likelihood was "too quick", such that it didn't pay out to distribute the calculations. When I on the other hand increased the time it takes for each likelihood cacluation ("stupid" mode, inserting a sleep() ), at some point of time it was worth to distribute the calculations with MPI.

My short summary was that with our current likelihood, it takes about 2ms/evaluation. With 400 livepoints, I got a speedup of ~2 when using MPI with 2 (and equivalently for 3) cores. If I in contrast parallelize the outer loop, i.e. run each realization simultaneously, I get a linear speedup until I have n_cores = n_realizations. So For 50 realizations, and not using more cores :), I'd have a much higher speed up by parallelizing the way we currently do it.

A side note:
If I increased to 1000 life_points, I suddenly see a slight sleep up:

1 core: 32s
2 cores, MPI: 19 s
3 core MPI, 17s

The only "disadvantage" of running 50 realizations on 50 different cores is, that I (have to) wait for the slowest realization to end before I continue.

vetlewi · 2020-03-11T14:28:06Z

I finally came around yesterday to check the MPI support of multinest and whether that would be a better way to parallelize. It turns out that parallelizing out current problem and ~400 livepoints with MPI makes the calculations slower or provides similar results.

I tested this with PyMultiNests minimal example first. MPI made it slower. This is because the evaluation of each single likelihood was "too quick", such that it didn't pay out to distribute the calculations. When I on the other hand increased the time it takes for each likelihood cacluation ("stupid" mode, inserting a sleep() ), at some point of time it was worth to distribute the calculations with MPI.

My short summary was that with our current likelihood, it takes about 2ms/evaluation. With 400 livepoints, I got a speedup of ~2 when using MPI with 2 (and equivalently for 3) cores. If I in contrast parallelize the outer loop, i.e. run each realization simultaneously, I get a linear speedup until I have n_cores = n_realizations. So For 50 realizations, and not using more cores :), I'd have a much higher speed up by parallelizing the way we currently do it.

A side note:
If I increased to 1000 life_points, I suddenly see a slight sleep up:

1 core: 32s

2 cores, MPI: 19 s

3 core MPI, 17s

The only "disadvantage" of running 50 realizations on 50 different cores is, that I (have to) wait for the slowest realization to end before I continue.

Sounds like MPI probably requires quite a lot of time to initialize

fzeiser · 2020-03-11T15:27:26Z

It could be either that it's the time to initialize, or the time to communicate. I'm not quite sure, but I guess the processes have to communicate with each other every time the n lifepoints have been update once. So you get quite some communication, which might will slow down, too

fzeiser · 2020-03-11T15:28:47Z

I didn't show it here, but I think I tried the same game for something which had a runtime of ~6 min -- and seemed to have a slightly better runtime without MPI. Again, this was for a case where the likelihood was very(!) quick to calculate.

vetlewi · 2020-03-11T15:33:13Z

I didn't show it here, but I think I tried the same game for something which had a runtime of ~6 min -- and seemed to have a slightly better runtime without MPI. Again, this was for a case where the likelihood was very(!) quick to calculate.

If you are using the gcloud VM it might not be optimized for MPI workloads but many independent threads.

fzeiser · 2020-03-11T15:36:51Z

Good point! Still, for now I dont'y see any reason to switch from multiprocessing to MPI.

fzeiser · 2020-03-11T15:37:28Z

I used the VM to test, as I was never able to install mpi4py propperly on my own machine

fzeiser requested a review from ErlendLima January 9, 2020 21:54

fzeiser force-pushed the dev/parallel branch 2 times, most recently from 9ff79c1 to 97cd212 Compare January 16, 2020 18:28

fzeiser force-pushed the dev/parallel branch from a0dfa0c to 6de608b Compare January 16, 2020 18:57

fzeiser added 2 commits January 16, 2020 20:08

reduced default number of cpus to 3

e9f5895

fzeiser force-pushed the dev/parallel branch from 6de608b to e9f5895 Compare January 16, 2020 19:09

Added noqa to overlong line due to typespec

b72aa6d

fzeiser merged commit 2e95d9e into master Jan 16, 2020

fzeiser deleted the dev/parallel branch January 17, 2020 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize working on Ensemble normalization #94

Parallelize working on Ensemble normalization #94

fzeiser commented Jan 9, 2020

review-notebook-app bot commented Jan 9, 2020

fzeiser commented Jan 9, 2020

fzeiser commented Jan 9, 2020

fzeiser commented Jan 9, 2020

fzeiser commented Jan 9, 2020

fzeiser commented Jan 9, 2020

fzeiser commented Jan 9, 2020

ErlendLima commented Jan 11, 2020

fzeiser commented Jan 16, 2020 •

edited

fzeiser commented Jan 16, 2020

vetlewi commented Jan 18, 2020 •

edited

fzeiser commented Mar 10, 2020

vetlewi commented Mar 11, 2020

fzeiser commented Mar 11, 2020

fzeiser commented Mar 11, 2020

vetlewi commented Mar 11, 2020

fzeiser commented Mar 11, 2020

fzeiser commented Mar 11, 2020

	raw_ensemble = np.zeros((number, *self.raw.shape))
	unfolded_ensemble = np.zeros_like(raw_ensemble)
	firstgen_ensemble = np.zeros_like(raw_ensemble)

	for step in tqdm(range(number)):
	LOG.info(f"Generating {step}")
	if self.bg is not None:
	prompt_w_bg = self.generate_perturbed(step, method,
	state="prompt+bg")
	bg = self.generate_perturbed(step, method, state="bg")
	raw = self.subtract_bg(step, prompt_w_bg, bg)
	else:
	raw = self.generate_perturbed(step, method, state="raw")
	unfolded = self.unfold(step, raw)
	firstgen = self.first_generation(step, unfolded)

	raw_ensemble[step, :, :] = raw.values
	unfolded_ensemble[step, :, :] = unfolded.values

Parallelize working on Ensemble normalization #94

Parallelize working on Ensemble normalization #94

Conversation

fzeiser commented Jan 9, 2020

review-notebook-app bot commented Jan 9, 2020

fzeiser commented Jan 9, 2020

fzeiser commented Jan 9, 2020

fzeiser commented Jan 9, 2020

fzeiser commented Jan 9, 2020

fzeiser commented Jan 9, 2020

fzeiser commented Jan 9, 2020

ErlendLima commented Jan 11, 2020

fzeiser commented Jan 16, 2020 • edited

fzeiser commented Jan 16, 2020

vetlewi commented Jan 18, 2020 • edited

fzeiser commented Mar 10, 2020

vetlewi commented Mar 11, 2020

fzeiser commented Mar 11, 2020

fzeiser commented Mar 11, 2020

vetlewi commented Mar 11, 2020

fzeiser commented Mar 11, 2020

fzeiser commented Mar 11, 2020

fzeiser commented Jan 16, 2020 •

edited

vetlewi commented Jan 18, 2020 •

edited