-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding ExecutionSpace partitioning function #4096
Conversation
core/src/Kokkos_Core.hpp
Outdated
// Customization point for backends | ||
// Default behavior is to return the passed in instance | ||
template <class ExecSpace, class... Args> | ||
std::array<ExecSpace, sizeof...(Args)> partition_space(ExecSpace space, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we intentionally requiring people to partition into a compile-time sized set of args? Should we have a variant where this takes a vector of weights and returns a vector of instances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm interesting question, we could have both, that said I didn't see a real need for that much runtime choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The partition count in EMPIRE will be runtime-determined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, it could just call the current unweighted CUDA implementation that just makes separate streams one by one, but that's probably not the usage you were going for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, using std::vector now
void check_equalness(ExecSpace, ExecSpace) {} | ||
|
||
#ifdef KOKKOS_ENABLE_CUDA | ||
void check_equalness(Kokkos::Cuda exec1, Kokkos::Cuda exec2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps check_distinctive
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sum1); | ||
Kokkos::parallel_reduce( | ||
Kokkos::RangePolicy<TEST_EXECSPACE>(instances[1], 0, N), SumFunctor(), | ||
sum2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This poses the same testing challenge as I noted in #4059 - confirming that various kernels actually ran on distinct execution spaces, and that they were at least hypothetically concurrent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the distinctive part is probably good enough for now
|
This allows the creation of multiple instances, only CUDA implements an actual implementation, everyone else returns the same thing multiple times for now.
I addressed all the points: for the int/float thing I allow mix, I am using a fold expression to static_assert that they all are, that fold expresssion is protected by feature test macro. |
@PhilMiller @DavidPoliakoff I think i addressed everything |
For runtime-determined partition count, I think the partition functions need to take |
@PhilMiller yeah, I believe we discussed this in the meeting. Should either be a vector, or a template<template typename> thingy, and I recommend vector |
core/src/Kokkos_Cuda.hpp
Outdated
template <class... Args> | ||
std::vector<Cuda> partition_space(Cuda space, Args...) { | ||
std::vector<Cuda> instances(sizeof...(Args)); | ||
#ifdef __cpp_fold_expressions | ||
static_assert( | ||
(... && std::is_arithmetic_v<Args>), | ||
"Kokkos Error: partitioning arguments must be integers or floats"); | ||
#endif | ||
for (int s = 0; s < int(sizeof...(Args)); s++) { | ||
cudaStream_t stream; | ||
CUDA_SAFE_CALL(cudaStreamCreate(&stream)); | ||
instances[s] = Cuda(stream, true); | ||
} | ||
return instances; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/var/jenkins/workspace/Kokkos/core/src/Kokkos_Cuda.hpp:260:40: error: unused parameter 'space' [clang-diagnostic-unused-parameter]
std::vector<Cuda> partition_space(Cuda space, Args...) {
^
/var/jenkins/workspace/Kokkos/core/src/Kokkos_Cuda.hpp:269:5: error: use of undeclared identifier 'CUDA_SAFE_CALL' [clang-diagnostic-error]
CUDA_SAFE_CALL(cudaStreamCreate(&stream));
^
62c6a23
to
62ef2f3
Compare
Also moved the CUda overloads to the Instances header file.
I think the variadic version only needs to be implemented once, in the generic code, since it can then call the version taking a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will suite EMPIRE's needs well enough. The interpretation of the weights will become an interesting question down the line, but it'll be good to get this step integrated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very lightly opposed to having a vector and a variadic overload. If anybody else wants to back this up, I'll change it to a request changes. But I think this is a good implementation of what we're aiming at
} | ||
|
||
template <class T> | ||
std::vector<HIP> partition_space(const HIP &, std::vector<T> &weights) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are weights
by non-const reference?
}; | ||
|
||
template <class ExecSpace> | ||
void check_distinctive(ExecSpace, ExecSpace) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean distinct from one another?
|
||
TEST(TEST_CATEGORY, partitioning_by_args) { | ||
auto instances = | ||
Kokkos::Experimental::partition_space(TEST_EXECSPACE(), 1, 1.); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we capture anywhere the rational for being able to mix types?
This allows the creation of multiple instances, but only CUDA implements an actual implementation, everyone else returns the same thing multiple times for now.
The design is as a customization point, which would allow for ADL hopefully.