-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronisation using shared-memory formalism in moment kinetics #140
Comments
I hadn't considered this kind of case when setting up the shared-memory parallelism. The hope when I was writing it would have been to just parallelise the spatial part of the loop, and leave velocity space in serial. Clearly that isn't what we want to do now. I think there is a way to implement this with the current code, and a couple of options for optimising for either communication speed (at least a little bit) or for memory usage. Current implementationThe option to implement straightforwardly would be to have buffer arrays for everything necessary (as you noted), end the spatial loops between each segment, and change the loop type, so that the whole shared-memory block synchronizes. For the example you gave, that would be something like begin_s_r_z_vpa_vperp_region()
@loop_s_r_z_vperp_vpa is ir iz ivperp ivpa begin
# do something parallelisable,
# e.g., set sources for an elliptic solve
S[ivpa,ivperp,iz,ir,is] = ...
# recast S into an array with a single, flattened dimension for velocity space indexed by c
#S[ic,iz,ir,is] = ...
# this can actually be done by creating a reshaped view, to save memory
S1D = reshape(S, vperp.n*vpa.n, z.n, r.n, nspecies)
end
# synchronise with the above call and do the matrix solve
# in the compound index c
begin_s_r_z_region()
@loop_s_r_z
@views field[:,iz,ir,is] = lu_obj \ S1D[:,iz,ir,is]
end
begin_s_r_z_vpa_vperp_region()
# get back to ivpa, ivperp
@loop_s_r_z_vperp_vpa is ir iz ivperp ivpa begin
result[ivpa,ivperp,iz,ir,is] = ....
end As you noted, the trouble with this approach is that all the buffers have to be the full Optimized approachIn principle we don't need to use that much memory, and we can optimise the communication time by only synchronizing velocity-space blocks (which should be faster because the blocks are smaller, I guess?). Then the buffers would only need to be of size
Optimize for memory usageThe way to optimize for minimal memory usage would be to not do shared-memory parallelism over space. Then we'd only need The example could look like begin_vpa_vperp_region()
@loop_s_r_z is ir iz begin
@loop_vperp_vpa ivperp ivpa begin
# do something parallelisable,
# e.g., set sources for an elliptic solve
S[ivpa,ivperp,iz,ir,is] = ...
# recast S into an array with a single, flattened dimension for velocity space indexed by c
#S[ic,iz,ir,is] = ...
# this can actually be done by creating a reshaped view, to save memory
S1D = reshape(S, vperp.n*vpa.n, z.n, r.n, nspecies)
end
# synchronise with the above call and do the matrix solve
# in the compound index c
_block_synchronize()
field = lu_obj \ S1D
# get back to ivpa, ivperp
_block_synchronize()
@loop_vperp_vpa ivperp ivpa begin
result[ivpa,ivperp,iz,ir,is] = ....
end
end Todo?If the 'optimized approach' is the way to go, I can start having a look at the looping code to add a communicator over vspace subblocks and the |
Comment 1:
Does the reshaped view index the array in the same way that I have chosen to when creating the assembled matrix problems? This detail is the reason why I have done everything manually -- I don't want to assume that my code and the Julia convention will remain consistent. For people reading the code, having the base level functions explicitly available seems nicer to understand what is happening. If we can use one of your MPI improvements, there is no reason for the dummy arrays to have the size of the pdf, so saving memory here shouldn't be important. Comment 2: I like the idea of approaches "Optimized approach" & "Optimize for memory usage". I would rather have the former, but I don't want to cause software development that isn't easy to execute for speculative reasons. Since C[F,F] is very slow to evaluate, we may not be able to use the operator with shared memory alone anyway, in which case these and more substantial changes will be needed. My suggestion would be that I try the latter option with the existing code framework, and then comment on the execution time for a simulation. If following that experiment we think that the optimised code is a must, we can look into that as and when. If however you want to think about the optimized approach anyway now that I have raised the idea, I am happy for you to PR into the branch https://github.com/mabarnes/moment_kinetics/tree/radial-vperp-standard-DKE-Julia-1.7.2-mpi. |
A further thought: In calculating the boundary data, I also used the For example, if I have to use
what are the consequences for the number of cores used in the loops? |
It's not safe to loop over only vpa or only vperp within a It should be fine to call |
Yes, I think so. |
My attempt to implement the weak-form Fokker-Planck operator into the time evolving code is in this commit: 3f6e4db. @johnomotani comments on whether or not I correctly followed your recommendations would be helpful! I tried to keep the |
Update for future reference: with help from @johnomotani, a shared-memory parallelised version of the C[F,F] collision operator exists on this commit c0cc4a2. The parallelisation is in EDIT: This observation can be confirmed in practice by running a 1D2V simulation and using distributed-memory MPI for the |
When generalising the Fokker-Planck collision operator test script, I need to convert a series of code blocks from supporting parallelism in
vpa
&vperp
to covering the whole ofz
r
s
vpa
andvperp
. This is not trivial because the operations are not wholly embarassingly parallel, but there are steps that are trivially parallelised which we should parallelise to get an optimally fast version of the operator.For example, in the test script there is a block which has the following form
The problem with adding
z
r
ands
is that we would ideally place the whole of the above inside a simple loop. None of the elliptic solves above are connected inz
r
ands
and so they should be parallelised over, i.e., we would like the following.However, in the current shared memory formalism, it does not seem possible to call a begin_*_region() within a loop. This seems to mean that we cannot synchronise the array data whilst inside the
z
r
s
loop. Previously, I have avoided this difficulty by introducing buffer arrays to store the appropriate data. The downside is that one needs to store data of sizenz*nr*ns*nvpa*nvperp
. For the Fokker-Planck operator this leads to unacceptable memory usage, as we would need ~ 10 pdf sized buffer arrays.Would it be possible to consider an upgrade to the shared-memory framework to permit synchronisation over a subset of the shared-memory block? In the example above, we would want to synchronise all of
vpa
andvperp
at fixediz
ir
is
. I am sure that this is possible with normal distributed memory MPI with an MPI.Allreduce. Unfortunately, the present distributed-memory MPI only parallelises elements, and not points, meaning that we cannot have a single spatial point on the shared-memory process.@johnomotani Discussion is appreciated!
The text was updated successfully, but these errors were encountered: