You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have read the README and ref/README, including the Troubleshooting Guide.
I have reviewed existing issues to ensure this has not already been reported.
Is your feature request related to a problem? Please describe.
The existing code only has support for parallelism via OpenMP threading. This limits the utility of the kernel for use in evaluating performance tradeoffs of various types of implementations. A proper evaluation must include analysis of multi-node as well as single-node parallel performance.
Describe the solution you'd like
A full featured MPI capability is needed. This will include:
Ability to run with any number of MPI tasks between 1 and N
Ability to run with MPI+OpenMP threads such that each MPI rank spawns a configurable number of OpenMP threads
A customizable processor grid for distribution of MPI ranks (e.g. 4x4, 2x8, 1x16)
A default processor grid that maximizes the "square-ness" of the grid for to minimize communication
Automatic domain decomposition of data onto the chosen processor grid
All options configurable at runtime in the input namelist
Describe alternatives you've considered
MPI is a ubiquitous and standard means of parallelizing across nodes of a supercomputer. While there may be other ways of achieving that, having MPI parallelism is necessary for establishing a baseline of performance against which other implementations should be compared.
The text was updated successfully, but these errors were encountered:
Adding MPI will be a large change. If possible, it would be easier to evaluate and make progress if it could be broken down into smaller pieces. It isn't clear to me yet what those pieces should be. We can discuss here. To start, I'm going to throw out an initial breakdown for us to discuss and refine.
Implement halos for the existing code. This would simply extend the dimensions of existing arrays without adding any parallelism. The code would not run in parallel, and would loop over the same indices as it does in serial, but the arrays would be properly dimensioned for halos.
Implement calculation of domain decomposition for a default processor grid (as square as possible) for an arbitrary number of MPI ranks. This would be a routine that computes the local indices and allocates arrays for each tile. Allocations and loop indices would be adjusted as needed. Still no parallelism, but addition of tests to verify that the decomposition works would be added.
Add MPI_Init() and implement a halo exchange. This would be a routine for exchanging the data in the halo. Add a test to verify that the halo exchange works for multiple numbers of MPI ranks. Add tests to show running with different MPI ranks produces the same results.
This is just a starting point. Please comment/suggest adjustments as needed. Breaking parallelization down into smaller pieces may prove quite difficult. However, the smaller we can make the steps toward the final goal, the easer it will be to both implement, review, merge each step along the way.
A decision was made to not pursue a full MPI parallelization. Instead a simulation of parallel execution (see #35) is provided by running N copies of the kernel with MPI including simulating the work of a halo exchange operation using the N copies of the serial kernel. Since no further work on MPI will be pursued, this issue is being closed.
Issue checklist
Is your feature request related to a problem? Please describe.
The existing code only has support for parallelism via OpenMP threading. This limits the utility of the kernel for use in evaluating performance tradeoffs of various types of implementations. A proper evaluation must include analysis of multi-node as well as single-node parallel performance.
Describe the solution you'd like
A full featured MPI capability is needed. This will include:
Describe alternatives you've considered
MPI is a ubiquitous and standard means of parallelizing across nodes of a supercomputer. While there may be other ways of achieving that, having MPI parallelism is necessary for establishing a baseline of performance against which other implementations should be compared.
The text was updated successfully, but these errors were encountered: