When using a storage order of {3, 2, 0, 1}, if the 3rd dimension has an extent of 1, its much slower to copy to the gpu than if it had an extent greater than 1. For example, a 1024 x 1024 x 3 x 1 shaped buffer takes 1000x longer to copy to the gpu than a 1024 x 1024 x 3 x 2 buffer, even though it has half the data.
On a Quadro T2000 and CUDA 12.6:
1024 x 1024 x 3 x 2 (ordered {3, 2, 0, 1}): 4 ms
1024 x 1024 x 3 x 1 (ordered {3, 2, 0, 1}): 5262 ms
1024 x 1024 x 3 x 2 (ordered {0, 1, 2, 3}): 4 ms
1024 x 1024 x 3 x 1 (ordered {0, 1, 2, 3}): 2 ms
On a RTX A5000 and CUDA 12.8:
1024 x 1024 x 3 x 2 (ordered {3, 2, 0, 1}): 1 ms
1024 x 1024 x 3 x 1 (ordered {3, 2, 0, 1}): 4463 ms
1024 x 1024 x 3 x 2 (ordered {0, 1, 2, 3}): 1 ms
1024 x 1024 x 3 x 1 (ordered {0, 1, 2, 3}): 0 ms
Confirmed with more detailed benchmarks as well, but this is a simple reproducer:
#include <chrono>
#include <vector>
#include <Halide.h>
long test(std::vector<int> order, std::vector<int> shape) {
auto target = Halide::get_host_target().with_feature(Halide::Target::CUDA);
Halide::Buffer<float> buf(shape, order);
for (int t = 0; t < shape[3]; ++t)
for (int c = 0; c < shape[2]; ++c)
for (int y = 0; y < shape[1]; ++y)
for (int x = 0; x < shape[0]; ++x)
buf(x, y, c, t) = x * .5f + y * 2.f + c * 4.f + t * .8f;
buf.set_host_dirty();
buf.device_malloc(target);
buf.copy_to_device(target);
buf.device_sync();
auto start = std::chrono::high_resolution_clock::now();
buf.set_host_dirty();
buf.copy_to_device(target);
buf.device_sync();
auto end = std::chrono::high_resolution_clock::now();
auto ms =
std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
printf("%d x %d x %d x %d (ordered {%d, %d, %d, %d}): %ld ms\n",
shape[0], shape[1], shape[2], shape[3],
order[0], order[1], order[2], order[3],
ms);
return ms;
}
int main() {
test({3, 2, 0, 1}, {1024, 1024, 3, 2});
test({3, 2, 0, 1}, {1024, 1024, 3, 1});
test({0, 1, 2, 3}, {1024, 1024, 3, 2});
test({0, 1, 2, 3}, {1024, 1024, 3, 1});
return 0;
}
When using a storage order of {3, 2, 0, 1}, if the 3rd dimension has an extent of 1, its much slower to copy to the gpu than if it had an extent greater than 1. For example, a 1024 x 1024 x 3 x 1 shaped buffer takes 1000x longer to copy to the gpu than a 1024 x 1024 x 3 x 2 buffer, even though it has half the data.
On a Quadro T2000 and CUDA 12.6:
On a RTX A5000 and CUDA 12.8:
Confirmed with more detailed benchmarks as well, but this is a simple reproducer: