Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronisation using shared-memory formalism in moment kinetics #140

Closed
mrhardman opened this issue Oct 30, 2023 · 7 comments
Closed

Synchronisation using shared-memory formalism in moment kinetics #140

mrhardman opened this issue Oct 30, 2023 · 7 comments

Comments

@mrhardman
Copy link
Collaborator

mrhardman commented Oct 30, 2023

When generalising the Fokker-Planck collision operator test script, I need to convert a series of code blocks from supporting parallelism in vpa & vperp to covering the whole of z r s vpa and vperp. This is not trivial because the operations are not wholly embarassingly parallel, but there are steps that are trivially parallelised which we should parallelise to get an optimally fast version of the operator.

For example, in the test script there is a block which has the following form

begin_vpa_vperp_region()
@loop_vperp_vpa ivperp ivpa begin
# do something parallelisable, 
# e.g., set sources for an elliptic solve
   S[ivpa,ivperp] = ... 
# recast S into a 1D array indexed by c
   S[ic] = ...
end 

begin_serial_region()
# synchronise with the above call and do the matrix solve
# in the compound index c
 field = lu_obj \ S

begin_vpa_vperp_region()
# get back to ivpa, ivperp
@loop_vperp_vpa ivperp ivpa begin
    result[ivpa,ivperp] = ....
end

The problem with adding z r and s is that we would ideally place the whole of the above inside a simple loop. None of the elliptic solves above are connected in z r and s and so they should be parallelised over, i.e., we would like the following.

begin_z_r_s_region()
@loop_s_r_z is ir iz begin
 # do stuff in velocity space
end

However, in the current shared memory formalism, it does not seem possible to call a begin_*_region() within a loop. This seems to mean that we cannot synchronise the array data whilst inside the z r s loop. Previously, I have avoided this difficulty by introducing buffer arrays to store the appropriate data. The downside is that one needs to store data of size nz*nr*ns*nvpa*nvperp. For the Fokker-Planck operator this leads to unacceptable memory usage, as we would need ~ 10 pdf sized buffer arrays.

Would it be possible to consider an upgrade to the shared-memory framework to permit synchronisation over a subset of the shared-memory block? In the example above, we would want to synchronise all of vpa and vperp at fixed iz ir is. I am sure that this is possible with normal distributed memory MPI with an MPI.Allreduce. Unfortunately, the present distributed-memory MPI only parallelises elements, and not points, meaning that we cannot have a single spatial point on the shared-memory process.

@johnomotani Discussion is appreciated!

@johnomotani
Copy link
Collaborator

I hadn't considered this kind of case when setting up the shared-memory parallelism. The hope when I was writing it would have been to just parallelise the spatial part of the loop, and leave velocity space in serial. Clearly that isn't what we want to do now. I think there is a way to implement this with the current code, and a couple of options for optimising for either communication speed (at least a little bit) or for memory usage.

Current implementation

The option to implement straightforwardly would be to have buffer arrays for everything necessary (as you noted), end the spatial loops between each segment, and change the loop type, so that the whole shared-memory block synchronizes. For the example you gave, that would be something like

begin_s_r_z_vpa_vperp_region()
@loop_s_r_z_vperp_vpa is ir iz ivperp ivpa begin
    # do something parallelisable, 
    # e.g., set sources for an elliptic solve
    S[ivpa,ivperp,iz,ir,is] = ... 
    # recast S into an array with a single, flattened dimension for velocity space indexed by c
    #S[ic,iz,ir,is] = ...
    # this can actually be done by creating a reshaped view, to save memory
    S1D = reshape(S, vperp.n*vpa.n, z.n, r.n, nspecies)
end 

# synchronise with the above call and do the matrix solve
# in the compound index c
begin_s_r_z_region()
@loop_s_r_z
    @views field[:,iz,ir,is] = lu_obj \ S1D[:,iz,ir,is]
end

begin_s_r_z_vpa_vperp_region()
# get back to ivpa, ivperp
@loop_s_r_z_vperp_vpa is ir iz ivperp ivpa begin
    result[ivpa,ivperp,iz,ir,is] = ....
end

As you noted, the trouble with this approach is that all the buffers have to be the full (nvpa,nvperp,nz,nr,nspecies).

Optimized approach

In principle we don't need to use that much memory, and we can optimise the communication time by only synchronizing velocity-space blocks (which should be faster because the blocks are smaller, I guess?). Then the buffers would only need to be of size (nvpa,nvperp,z.nrank,r.nrank,composition.nrank_ion), which hopefully should save an order of magnitude on the size of those buffers (note for myself - I don't think composition includes nrank and irank yet, but we can add that easily - we probably have to add separate variables for ions and neutrals). We'd need:

  1. A command like vspace_block_synchronize() that synchronizes just the velocity space block, so can be called within a spatially-parallelised loop. This should be OK to implement - for each region type we can create a communicator that groups processes that cover the velocity dimensions for a given spatial point or set of spatial points; then vspace_block_synchronize() just calls MPI.Barrier() on that communicator. The parallelisation was a 'direct product' so the shared memory block trivially splits into subblocks in the way that we want for this.
  2. Some mechanism to allocate buffers of size (nvpa,nvperp,zproc,rproc,sproc) and get an appropriate nvpa,nvperp sized slice of them. We'd need to add a way to get the number of sub-blocks (n_vspace_subblock), and the index of the 'current' subblock i_vspace_sublock), to be able to do something like
    buffer = allocated_shared(vpa.n, vperp.n, n_vspace_subblock)
    and get the appropriate subblock like
    begin_s_r_z_vperp_vpa_region()
        local_buf = @view buffer[:,:,i_vspace_subblock]
        ... do stuff ...
    end
    I worry a bit (although we should probably ignore this for now), that this will make the code significantly more confusing, because it is hard to do nicely in a general way. For now, we probably just assume that the only place we wan these buffers is in the collision operator, and there we are in a begin_s_r_z_vperp_vpa() type region. The difficulty is that each different region type would (potentially) have a different number of vspace subblocks (e.g. in a begin_s_z_vperp_vpa() region where r is not parallelised, the splitting of the dimensions that are parallelised would be different). [It might make sense to have some kind of get_**_buffer() function(s) that would grab a buffer from a global list if available, or allocate a new one if not available. The immediately-obvious-to-me problem is how to 'release' a buffer back into the list again when we are finished with it - some explicit function call would be possible, but it would be much nicer to somehow do this automatically...]

Optimize for memory usage

The way to optimize for minimal memory usage would be to not do shared-memory parallelism over space. Then we'd only need (vpa.n,vperp.n) sized buffers, and it would be fine to call _block_synchronize() inside loops of s,r,z,. This approach doesn't need any modifications to the looping/parallelisation code.

The example could look like

begin_vpa_vperp_region()
@loop_s_r_z is ir iz begin
    @loop_vperp_vpa ivperp ivpa begin
        # do something parallelisable, 
        # e.g., set sources for an elliptic solve
        S[ivpa,ivperp,iz,ir,is] = ... 
        # recast S into an array with a single, flattened dimension for velocity space indexed by c
        #S[ic,iz,ir,is] = ...
        # this can actually be done by creating a reshaped view, to save memory
        S1D = reshape(S, vperp.n*vpa.n, z.n, r.n, nspecies)
    end 

    # synchronise with the above call and do the matrix solve
    # in the compound index c
    _block_synchronize()
    field = lu_obj \ S1D

    # get back to ivpa, ivperp
    _block_synchronize()
    @loop_vperp_vpa ivperp ivpa begin
        result[ivpa,ivperp,iz,ir,is] = ....
    end
end

Todo?

If the 'optimized approach' is the way to go, I can start having a look at the looping code to add a communicator over vspace subblocks and the vspace_block_synchronize() function. I could try to do that as a PR into your branch with the collision operator?

@mrhardman
Copy link
Collaborator Author

Comment 1:

# this can actually be done by creating a reshaped view, to save memory
S1D = reshape(S, vperp.n*vpa.n, z.n, r.n, nspecies)

Does the reshaped view index the array in the same way that I have chosen to when creating the assembled matrix problems? This detail is the reason why I have done everything manually -- I don't want to assume that my code and the Julia convention will remain consistent. For people reading the code, having the base level functions explicitly available seems nicer to understand what is happening. If we can use one of your MPI improvements, there is no reason for the dummy arrays to have the size of the pdf, so saving memory here shouldn't be important.

Comment 2:

I like the idea of approaches "Optimized approach" & "Optimize for memory usage". I would rather have the former, but I don't want to cause software development that isn't easy to execute for speculative reasons. Since C[F,F] is very slow to evaluate, we may not be able to use the operator with shared memory alone anyway, in which case these and more substantial changes will be needed. My suggestion would be that I try the latter option with the existing code framework, and then comment on the execution time for a simulation. If following that experiment we think that the optimised code is a must, we can look into that as and when.

If however you want to think about the optimized approach anyway now that I have raised the idea, I am happy for you to PR into the branch https://github.com/mabarnes/moment_kinetics/tree/radial-vperp-standard-DKE-Julia-1.7.2-mpi.

@mrhardman
Copy link
Collaborator Author

A further thought: In calculating the boundary data, I also used the begin_vpa_region() and the begin_vperp_region() commands, in addition to the begin_vperp_vpa_region() used in your recommendation above. Presuming that I cannot switch between regions within the @loop_s_r_z macro, what are the consequences for using the "wrong" region for a loop command?

For example, if I have to use

begin_vperp_vpa_region()
@loop_s_r_z is ir iz begin
    # ... stuff
     @loop_vpa ivpa begin
            # stuff that is embarassingly parallel in vpa
     end
     # ... stuff
     @loop_vperp ivperp begin
            # stuff that is embarassingly parallel in vperp
     end
end

what are the consequences for the number of cores used in the loops?

@johnomotani
Copy link
Collaborator

It's not safe to loop over only vpa or only vperp within a begin_vperp_vpa_region() block. Two processes that own the same vpa indices but different vperp indices would try to write to the same locations in an array, which would be an error (one of the annoying 'undefined behaviour with effects that change from run to run' errors).

It should be fine to call begin_vperp_region() and begin_vpa_region() within an @loop_s_r_z though. As the species, r, and z dimensions aren't parallelised, all the processes in a shared-memory block will always call those functions at the same time, as long as the region type does not include any parallelised species or spatial dimensions when entering the @loop_s_r_z (i.e. call e.g. begin_vpa_region() before the loop).
The reason that generally it's not allowed to call begin_*_region() within a parallelised loop, is that if say the r dimension is parallelised and you are within an @loop_s_r_*, different processes may 'own' slightly different numbers of r-points, and so if there was a begin_vpa_region() inside the loop, they would call it different numbers of times. That would lead to the code hanging as some processes in the shared-memory block communicator call MPI.Barrier() and others don't.

@johnomotani
Copy link
Collaborator

Does the reshaped view index the array in the same way that I have chosen to when creating the assembled matrix problems?

Yes, I think so. reshape() leaves the underlying array untouched and the vpa dimension is the fastest-varying dimension in the original array, so incrementing the 'flattened' index by 1 would increment ivpa by 1, and incrementing the 'flattened' index by vpa.n would increment ivperp by 1.

@mrhardman
Copy link
Collaborator Author

My attempt to implement the weak-form Fokker-Planck operator into the time evolving code is in this commit: 3f6e4db.

@johnomotani comments on whether or not I correctly followed your recommendations would be helpful! I tried to keep the s r and z coordinates local, and left the parallelism over vpa vperp as was chosen for the original test script.

@mrhardman
Copy link
Collaborator Author

mrhardman commented Nov 1, 2023

Update for future reference: with help from @johnomotani, a shared-memory parallelised version of the C[F,F] collision operator exists on this commit c0cc4a2.

The parallelisation is in vpa vperp as in the original test script, with serial looping in z r s. Without the optimisation described above (adding the ability to the shared-memory formalism to synchronise arrays with the same iz ir is affiliation within an @loop_s_r_z), it may be preferable for production runs with a range of positions and species to change the parallelisation to be over z r, keeping s vpa vperp local. This is because the collision operator does not couple spatial positions (cross-species collisions can couple species).

EDIT: This observation can be confirmed in practice by running a 1D2V simulation and using distributed-memory MPI for the z dimension, and comparing the same problem to a case where only shared-memory MPI is used (with the same number of cores). The distributed-memory MPI simulation is faster by an order of magnitude, suggesting that it is better to parallelise over space than velocities. More careful profiling will be required before picking an optimised MPI scheme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants