`copy_to_device` is extremely slow on CUDA with storage order {3, 2, 0, 1} and 3rd extent = 1

When using a storage order of {3, 2, 0, 1}, if the 3rd dimension has an extent of 1, its much slower to copy to the gpu than if it had an extent greater than 1. For example, a 1024 x 1024 x 3 x 1 shaped buffer takes 1000x longer to copy to the gpu than a 1024 x 1024 x 3 x 2 buffer, even though it has half the data.

On a Quadro T2000 and CUDA 12.6:
```
1024 x 1024 x 3 x 2 (ordered {3, 2, 0, 1}): 4 ms
1024 x 1024 x 3 x 1 (ordered {3, 2, 0, 1}): 5262 ms
1024 x 1024 x 3 x 2 (ordered {0, 1, 2, 3}): 4 ms
1024 x 1024 x 3 x 1 (ordered {0, 1, 2, 3}): 2 ms
```
On a RTX A5000 and CUDA 12.8:
```
1024 x 1024 x 3 x 2 (ordered {3, 2, 0, 1}): 1 ms
1024 x 1024 x 3 x 1 (ordered {3, 2, 0, 1}): 4463 ms
1024 x 1024 x 3 x 2 (ordered {0, 1, 2, 3}): 1 ms
1024 x 1024 x 3 x 1 (ordered {0, 1, 2, 3}): 0 ms
```

Confirmed with more detailed benchmarks as well, but this is a simple reproducer:
```cpp
#include <chrono>
#include <vector>

#include <Halide.h>

long test(std::vector<int> order, std::vector<int> shape) {
    auto target = Halide::get_host_target().with_feature(Halide::Target::CUDA);

    Halide::Buffer<float> buf(shape, order);
    for (int t = 0; t < shape[3]; ++t)
        for (int c = 0; c < shape[2]; ++c)
            for (int y = 0; y < shape[1]; ++y)
                for (int x = 0; x < shape[0]; ++x)
                    buf(x, y, c, t) = x * .5f + y * 2.f + c * 4.f + t * .8f;

    buf.set_host_dirty();

    buf.device_malloc(target);
    buf.copy_to_device(target);
    buf.device_sync();

    auto start = std::chrono::high_resolution_clock::now();
    buf.set_host_dirty();
    buf.copy_to_device(target);
    buf.device_sync();
    auto end = std::chrono::high_resolution_clock::now();

    auto ms =
        std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    printf("%d x %d x %d x %d (ordered {%d, %d, %d, %d}): %ld ms\n",
           shape[0], shape[1], shape[2], shape[3],
           order[0], order[1], order[2], order[3],
           ms);
    return ms;
}

int main() {
    test({3, 2, 0, 1}, {1024, 1024, 3, 2});
    test({3, 2, 0, 1}, {1024, 1024, 3, 1});

    test({0, 1, 2, 3}, {1024, 1024, 3, 2});
    test({0, 1, 2, 3}, {1024, 1024, 3, 1});
    return 0;
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`copy_to_device` is extremely slow on CUDA with storage order {3, 2, 0, 1} and 3rd extent = 1 #8956

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

copy_to_device is extremely slow on CUDA with storage order {3, 2, 0, 1} and 3rd extent = 1 #8956

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`copy_to_device` is extremely slow on CUDA with storage order {3, 2, 0, 1} and 3rd extent = 1 #8956