-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use memset in deep_copy #3944
Use memset in deep_copy #3944
Conversation
Some more results on V100
|
SYCL V100 before
after
|
For HIP I am seeing (old)
and (new)
|
Please use the benchmark comparison tool when you post results |
4cd659a
to
948958c
Compare
9ff4137
to
5e7766b
Compare
CUDA:
|
SYCL:
|
HIP
|
I think this is now good enough to be looked at and being discussed (in particular if the results are good enough). |
I couldn't find any |
The testing script was #include <Kokkos_Core.hpp>
#include <benchmark/benchmark.h>
struct TestZeroing
{
template <class ExecutionSpace, class View>
TestZeroing(ExecutionSpace const &s, View x) {
run(s, x);
}
template <class ExecutionSpace, class View>
void run(ExecutionSpace const &, View x) {
Kokkos::deep_copy(x, 0);
}
};
struct TestZeroingAsync
{
template <class ExecutionSpace, class View>
TestZeroingAsync(ExecutionSpace const &s, View x) {
run(s, x);
}
template <class ExecutionSpace, class View>
void run(ExecutionSpace const &s, View x) {
Kokkos::deep_copy(s, x, 0);
s.fence();
}
};
struct TestOne
{
template <class ExecutionSpace, class View>
TestOne(ExecutionSpace const &s, View x) {
run(s, x);
}
template <class ExecutionSpace, class View>
void run(ExecutionSpace const &, View x) {
Kokkos::deep_copy(x, 1);
}
};
struct TestOneAsync
{
template <class ExecutionSpace, class View>
TestOneAsync(ExecutionSpace const &s, View x) {
run(s, x);
}
template <class ExecutionSpace, class View>
void run(ExecutionSpace const &s, View x) {
Kokkos::deep_copy(s, x, 1);
s.fence();
}
};
template <class K, typename T>
void BM_generic(benchmark::State &state) {
#if defined(KOKKOS_ENABLE_HIP)
using ExecutionSpace = Kokkos::Experimental::HIP;
#elif defined(KOKKOS_ENABLE_CUDA)
using ExecutionSpace = Kokkos::Cuda;
#elif defined(KOKKOS_ENABLE_OPENMPTARGET)
using ExecutionSpace = Kokkos::Experimental::OpenMPTarget;
#else
using ExecutionSpace = Kokkos::Experimental::SYCL;
#endif
int n = state.range(0);
ExecutionSpace space{};
Kokkos::View<T *, ExecutionSpace> x("x", n);
K(space, x); // warm-up
for (auto _ : state) {
K(space, x);
}
state.counters["Bandwidth"] =
benchmark::Counter(sizeof(T) * n,
benchmark::Counter::kIsIterationInvariantRate);
}
#define REGISTER_BENCHMARK(KERNEL, TYPE) \
BENCHMARK_TEMPLATE(BM_generic, KERNEL, TYPE) \
->RangeMultiplier(8) \
->Range(1024, 8 << 24) \
->UseRealTime();
REGISTER_BENCHMARK(TestZeroing, char);
REGISTER_BENCHMARK(TestZeroing, int);
REGISTER_BENCHMARK(TestZeroing, double);
REGISTER_BENCHMARK(TestZeroing, std::complex<double>);
REGISTER_BENCHMARK(TestZeroingAsync, char);
REGISTER_BENCHMARK(TestZeroingAsync, int);
REGISTER_BENCHMARK(TestZeroingAsync, double);
REGISTER_BENCHMARK(TestZeroingAsync, std::complex<double>);
REGISTER_BENCHMARK(TestOne, char);
REGISTER_BENCHMARK(TestOne, int);
REGISTER_BENCHMARK(TestOne, double);
REGISTER_BENCHMARK(TestOne, std::complex<double>);
REGISTER_BENCHMARK(TestOneAsync, char);
REGISTER_BENCHMARK(TestOneAsync, int);
REGISTER_BENCHMARK(TestOneAsync, double);
REGISTER_BENCHMARK(TestOneAsync, std::complex<double>);
int main(int argc, char **argv) {
Kokkos::initialize(argc, argv);
benchmark::Initialize(&argc, argv);
benchmark::RunSpecifiedBenchmarks();
Kokkos::finalize();
} |
Hm Looks like for HIP right now its not a win generally. |
Still probably good enough |
core/src/Kokkos_CopyViews.hpp
Outdated
} | ||
|
||
template <typename ExecutionSpace, class DT, class... DP> | ||
inline void plain_memcpy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Plain memcpy" does not describe what this function does.
inline void plain_memcpy( | |
void copy_assign_value( |
(must find a more appropriate name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point I tried making here is that we know that the buffer to be filled is contiguous in memory. What about contiguous_fill
or similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or flat_fill
since we are calling ViewFill<ViewTypeFlat,...
in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
contiguous_fill sounds good to me.
core/src/Kokkos_CopyViews.hpp
Outdated
} | ||
|
||
template <class DT, class... DP> | ||
inline void memset(const View<DT, DP...>& dst, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline void memset(const View<DT, DP...>& dst, | |
void fill(const View<DT, DP...>& dst, |
(must fix)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the choice here depends a little bit on the one for plain_memcpy
. This function does essentially the same (and calls plain_memcpy
but has a special case for the zero_byte_case).
I don't like fill
since it doesn't say that the underlying memory must be contiguous. memset
would say that but is too strong.
What about flat_fill_or_memset
here?
Results for
are pretty impressive. |
9f4f71b
to
7a70acc
Compare
7a70acc
to
2fb4409
Compare
@dalg24 I addressed your commits apart from the naming ones where I'd like to discuss some more and also used |
bce5577
to
b2f0413
Compare
I also looked into doing the
but it's not quite clear if that's better (or what a good threshold would be) than a single
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Damiens comment still stands that plain_memcpy is not a good one probably.
Yeah thats fine. |
I replaced |
Cleanup history or want me to squash and merge? |
Just squash and merge. I don't think there is anything worth keeping in the history. |
Fixes #3930. As noted in #3930 (comment) this seems to be advantageous at least for
Cuda
but we should benchmark this a little better and also possibly extend to other backends.