Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]Stream and the best practice of thrust STL usage. #71

Closed
ZhenshengLee opened this issue Aug 18, 2021 · 4 comments
Closed

[Question]Stream and the best practice of thrust STL usage. #71

ZhenshengLee opened this issue Aug 18, 2021 · 4 comments

Comments

@ZhenshengLee
Copy link
Contributor

cuda_stream is about concurrency of kernel functions.

According to https://github.com/neka-nat/cupoch/blob/master/src/cupoch/geometry/pointcloud.h, each pointcloud has three vectors that are points, normals and colors. Usage of stream can be applied to pointcloud.

But I found that not all functions apply this policy.

For example, passthroughFilter uses only the default stream, and downsample function uses 3 streams to perform operations in each vector.

see

std::shared_ptr<PointCloud> PointCloud::PassThroughFilter(int axis_no,

std::shared_ptr<PointCloud> PointCloud::UniformDownSample(

@neka-nat Could you explain the reason why you choose to do so?

Or are there any drawbacks to use streams in cuda?

Thanks.

@neka-nat
Copy link
Owner

Thanks!
The use of three streams, which is used by UniformDownSample, is probably ineffective.
Speeding up with streams requires asynchronous processing, but thrust copy function is synchronous.
If you change to the default stream, the performance will not change much.
NVIDIA/thrust#827 (comment)

Switching to async::copy may speed up the process.

@ZhenshengLee
Copy link
Contributor Author

Thanks!
The use of three streams, which is used by UniformDownSample, is probably ineffective.
Speeding up with streams requires asynchronous processing, but thrust copy function is synchronous.
If you change to the default stream, the performance will not change much.
NVIDIA/thrust#827 (comment)

Switching to async::copy may speed up the process.

Thanks.

AFAK, async::copy is not supported in thrust, but cudaMemcpyAsync with thrust::raw_pointer_cast would be a workaround way accoring to NVIDIA/thrust#827 (comment), and this method is implemented in perception_cupoch with this commit ZhenshengLee/perception_cupoch@8ba8b91

From NVIDIA/thrust#827 (comment) it seems that most of algorithms of thrust would be blocking except for for_each, may std::async would help to simplify the code.

From NVIDIA/thrust#827 (comment) pinned_memory would also help to get async copy, but pinned_memory is not a good option for iGPU devices.

@neka-nat
Copy link
Owner

neka-nat commented Aug 19, 2021

I used async::copy to test if async is enabled.
The implementation using async::copy is given in the latest master.
The nvvp results show that the async is working well.

Use thrust::copy
sync_copy

Use thrust::async::copy
async_copy

The calculation time was also faster using async::copy.

import time
import cupoch as cph

if __name__ == "__main__":
    print("Load a ply point cloud, print it, and render it")
    pcd = cph.io.read_point_cloud("../../testdata/icp/cloud_bin_2.pcd")
    cph.visualization.draw_geometries([pcd])

    start = time.time()
    for _ in range(100):
        uni_down_pcd = pcd.uniform_down_sample(every_k_points=5)
    print(time.time() - start)
    cph.visualization.draw_geometries([uni_down_pcd])
    # sync: 0.025475502014160156
    # async: 0.018369436264038086

@ZhenshengLee
Copy link
Contributor Author

I have checked with https://github.com/NVIDIA/thrust/blob/main/CHANGELOG.md ,
async::copy and other async algorithms are enabled since Thrust 1.9.4 (CUDA Toolkit 10.1)

This policy can be easily applied to all codes that uses these algorithms.

thrust::async::reduce.
thrust::async::reduce_into, which takes a target location to store the reduction result into.
thrust::async::copy, including a two-policy overload that allows explicit cross system copies which execution policy properties can be attached to.
thrust::async::transform.
thrust::async::for_each.
thrust::async::stable_sort.
thrust::async::sort.

Great! for jetpack 4.4.1, the version is 1.9.7-1 CUDA Toolkit 10.2 for Tegra

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants