-
Notifications
You must be signed in to change notification settings - Fork 846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CUDA/HIP implementations of reduction operators #12569
base: main
Are you sure you want to change the base?
Conversation
} | ||
|
||
if (MCA_ACCELERATOR_NO_DEVICE_ID == target_device) { | ||
opal_accelerator.mem_release_stream(device, target, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just as a thought for a subsequent PR, we could get rid of the mem_alloc and mem_release functions in the accelerator framework interfaces and have only the stream based version, with the default stream being used if no stream argument has been provided by the user. This would reduce the API functions a bit and avoid nearly identical code.
ompi/mca/op/rocm/Makefile.am
Outdated
sources = op_rocm_component.c op_rocm.h op_rocm_functions.c op_rocm_impl.h | ||
rocm_sources = op_rocm_impl.hip | ||
|
||
HIPCC = hipcc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might have to change that in the near future. hipcc is going away, we should be using amdclang with --offload-arch arguments. Its ok to leave it for now as is.
#define xstr(x) #x | ||
#define str(x) xstr(x) | ||
|
||
#define CHECK(fn, args) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't abort inside the software stack.
@@ -152,22 +152,50 @@ int ompi_op_base_op_select(ompi_op_t *op) | |||
} | |||
|
|||
/* Copy over the non-NULL pointers */ | |||
for (i = 0; i < OMPI_OP_BASE_TYPE_MAX; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too many unnecessary retain/release. We can instead manipulate the avail->ao_module
module when needed, starting from the highest to the lowest priority and calling the corresponding op until one returns success.
ompi/mca/op/cuda/Makefile.am
Outdated
#sources_extended = op_cuda_functions.cu | ||
cu_sources = op_cuda_impl.cu | ||
|
||
NVCC = nvcc -g |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not good, we need to discover the CUDA compiler via the official mechanisms instead of hardcoding it here. At best you should use the one discovered by the m4 script nvcc_bin
but even that is a stretch as we will not know what flags to pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's easy to replace $nvcc_bin but I will need some help with integrating configure flags for the nvcc flags.
|
||
# -o $($@.o:.lo) | ||
|
||
# Open MPI components can be compiled two ways: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is especially not true for this component, it can only be built dynamically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operator support should only be built dynamically? @edgargabriel suggested that they should be made dynamic by default but should we disallow static building entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I correctly understand, allowing static builds forces libompi.so to have a dependency on CUDA. This will break the build on non-CUDA machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The accelerator components are dynamic-by-default (#12055) but I couldn't find a similar mechanism for OMPI. We should still allow building the ops statically, for those who know what they are doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as a component calls into libcuda
(or more precisely in this case libcudart
) it never be built statically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why that is. The OMPI library would have to be linked against libcudart
but that's possible if you build for a CUDA environment specifically.
I marked the two op modules as dso-by-default now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for the sake of it, I build ompi/main with CUDA from scratch, and now the dependency on libcudart exists everywhere, including ompi_info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is broken on main
, this branch doesn't change that.
ompi/mca/op/cuda/op_cuda_impl.cu
Outdated
const int stride = blockDim.x * gridDim.x; \ | ||
for (int i = index; i < n/vlen; i += stride) { \ | ||
vtype vin = ((vtype*)in)[i]; \ | ||
vtype vinout = ((vtype*)inout)[i]; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you use the templated op defined earlier in the file ? Or if you don't need it you should remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am reworking the vectorization to make it more flexible and avoid some of the stuff I had to do to map the fixed size integers onto vectors of variable size integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworked the vectorization with a custom type and some template work. The goal now is to consistently have 128bit loads and stores.
/** Function pointers for all the different datatypes to be used | ||
with the MPI_Op that this module is used with */ | ||
ompi_op_base_handler_fn_1_0_0_t opm_fns[OMPI_OP_BASE_TYPE_MAX]; | ||
ompi_op_base_3buff_handler_fn_1_0_0_t opm_3buff_fns[OMPI_OP_BASE_TYPE_MAX]; | ||
union { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overly complicated, but I can't think of anything significantly better right now.
3ab3371
to
3e4425d
Compare
#pragma GCC diagnostic push | ||
#pragma GCC diagnostic ignored "-Wgnu-zero-variadic-macro-arguments" | ||
|
||
static inline void device_op_pre(const void *orig_source1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bosilca @edgargabriel
If the device_op_pre
and device_op_post
use the accelerator framework they are pretty much independent of the the model (minus the last two lines). I wonder whether they should be moved to a header in base/
and shared between the two implementations.
The last two lines can be taken out and put into the op macro from where they are called.
3d1ef9c
to
6a85957
Compare
f689d6d
to
25c24c9
Compare
Let me add some generic comment here, mostly as a reminder to self. The reason is that I don't think this is how we should use these op modules, especially not with accelerators. In my vision we decide once and for all, for each operation (or collective) what MPI_Op we will use, and we will stay with it for the entire duration. First, because there is no reason to execute half of the MPI_Op on the host and the other half on the device, it is all or none. Second, because we definitely don't want to start each kernel independently, the overhead will be just too costly, annihilating most of the benefits. Instead, once we start a collective, we would start a "service" bound to a specific context (GPU or CPU) and this service will remain active for as long as we are in a collective that needs GPU op, removing all costs related to the kernel submission. Instead, the GPU threads will poll into a well defined memory location for work updates, and the CPU will post new ops in this queue. The only drawback I can see is that the service will take resources from the application, but this loss is very small, as a single (or two SM) are more than enough to saturate the network bandwidth. Once, we are outside collectives requiring GPU op, we can release these resources back to the application. |
The operators are generated from macros. Function pointers to kernel launch functions are stored inside the ompi_op_t as a pointer to a struct that is filled if accelerator support is available. The ompi_op* API is extended to include versions taking streams and device IDs to allow enqueuing operators on streams. The old functions map to the stream versions with a NULL stream. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
CUDA provides only limited vector widths and only for variable width integer types. We use our own vector type and some C++ templates to get more flexible vectors. We aim to get 128bit loads by adjusting the width based on the type size. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
The CUDA test fails because we detect CUDA but NVCC is not available (at least in |
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
config/opal_check_nvcc.m4
Outdated
# no path specified, try to find nvcc | ||
[AC_PATH_PROG([OPAL_NVCC], [nvcc], [])]) | ||
|
||
# If the user requested to disable sphinx, then pretend we didn't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvcc not sphinx I assume
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
I updated the PR to have precious variables |
This is the second part #12318, which provides the device-side reduction operators and adds stream semantics to
ompi_op_reduce
.As usual, the operators are generated from macros. Function pointers to kernel launch functions are stored inside the ompi_op_t as a pointer to a struct that is filled if accelerator support is available.
There are two pieces to the cuda/hip implementation:
Currently not supported are
short float
andlong double
since they are either not supported everywhere or not standardized. I hope I caught all other types, including pair types for loc functions. Since the implementations are agnostic of OMPI/OPAL headers, the code has to map the fortran types to C types in the implementation.The
device_op_pre
anddevice_op_post
functions are there to setup the environment for the kernel, including allocating memory on the device if one of the inputs is not on the chosen device. Operators cannot return an error so whatever the caller feeds us we have to eat. Not pretty, but hopefully better than aborting.This branch requires #12356. I will rebase once that is merged.
Questions: