[BUG] performance in ros callback context. #62

ZhenshengLee · 2021-04-20T06:25:01Z

As #60 did, I gave more tests on performance.

The passfilter function costs 0.096551ms to filter 119978 points to 5510.

But this output is in a simple execution context, you can find the code in #60

When the function run in ros callback context, the performance becomes really low, as 0.670853ms

Any ideas about the reason?

ZhenshengLee · 2021-04-20T06:25:44Z

the code is in below.

// prj hdrs
#include "cupoch_conversions/cupoch_conversions.h"

// ros hdrs
#include <ros/ros.h>
#include <sensor_msgs/PointCloud2.h>

using namespace std;
using namespace cupoch;

std::string camera_point_topic;
auto cloud = std::make_shared<geometry::PointCloud>();
sensor_msgs::PointCloud2 m_pub_cupoch_pc;

void points_callback(const sensor_msgs::PointCloud2ConstPtr& msg)
{
    auto start = ros::WallTime::now();
    auto end = ros::WallTime::now();
    auto t1 = ros::WallTime::now();
    auto t2 = ros::WallTime::now();

    t1 = ros::WallTime::now();
    cupoch_conversions::rosToCupoch(msg, cloud);
    t2 = ros::WallTime::now();
    ROS_INFO_STREAM("detect_cupoch_thread rosToCupoch time: " << (t2 - t1).toSec() * 1000.0 << "[ms]");

    if (cloud->HasPoints())
    {
        ROS_INFO("m_detect_cupoch_thread - point size is %d .", cloud->points_.size());
    }

    // 发布滤波后点云
    t1 = ros::WallTime::now();
    cloud = cloud->PassThroughFilter(0, -0.5, 0.5);
    t2 = ros::WallTime::now();
    ROS_INFO_STREAM("detect_cupoch_thread passXYZFilter time: " << (t2 - t1).toSec() * 1000.0 << "[ms]");
    ROS_INFO("m_detect_cupoch_thread passXYZFilter size is %d .", cloud->points_.size());

    t1 = ros::WallTime::now();
    cupoch_conversions::cupochToRos(cloud, m_pub_cupoch_pc, "/gpuac/pointcloud_frame");
    m_pub_cupoch_pc.header.stamp = ros::Time::now();
    m_pub_cupoch_pc.header.frame_id = "/gpuac/pointcloud_frame";
    t2 = ros::WallTime::now();
    ROS_INFO_STREAM("detect_cupoch_thread cupochToRos time: " << (t2 - t1).toSec() * 1000.0 << "[ms]");

    end = ros::WallTime::now();
    ROS_INFO_STREAM("detect_cupoch_thread processing_time: " << (end - start).toSec() * 1000.0 << "[ms]");
}

int main(int argc, char** argv)
{
    ros::init(argc, argv, "cupoch_conversions_test_node");
    ros::NodeHandle private_nh("~");

    utility::InitializeAllocator(utility::PoolAllocation, 1000000000);
    utility::SetVerbosityLevel(utility::VerbosityLevel::Debug);
    cudaSetDeviceFlags( cudaDeviceScheduleBlockingSync);

    private_nh.param("camera3d_point_topic", camera_point_topic, std::string("/points_cloud"));

    auto points_sub = private_nh.subscribe(camera_point_topic, 10, points_callback);

    ros::spin();

    return 0;
}

neka-nat · 2021-05-05T07:22:24Z

Hi!
Why don't you try taking a profile?
If you compare it to the #60 code, you may find something.

ZhenshengLee · 2021-08-18T13:47:15Z

@neka-nat I'v found that cpu freq has an impact to memcpy time between device and host.

The cpu freq should be set to the higghest mode.

sudo cpufreq-set -g performance

May closing this issue.

ZhenshengLee · 2021-09-17T12:54:42Z

Hi @neka-nat I finally used nvvp to profile ros callback of voxel-downsample in the context of ROS.

You can check a performance test result of downsample a 11w points cloud in https://github.com/ZhenshengLee/ga_points_downsampler/blob/master/README.md

In fact, the performance of downsampler is not better than that of cuda_pcl in https://github.com/NVIDIA-AI-IOT/cuda-pcl
AlThouth memory pool tech of rmm is used, which beyond my expectations.

after setting this tunning cmd , I get profile of cupoch's voxel downsample, the code is in this repo https://github.com/ZhenshengLee/ga_points_downsampler

nvidia-settings -a '[gpu:0]/GPUPowerMizerMode=1'
sudo cpufreq-set -g performance

As sort_by_key has no async version of this according to #71 (comment)

Is there any method to improve the performance of this thrust based cuda program?

Thanks.

Edit , the nvvp file is uploaded if you like.
pointdownsampler.nvvp.zip

created by nvvp 11.0.

neka-nat · 2021-09-27T01:34:59Z

Hi,
Thank you for reporting.
The current voxelfilter algorithm is bottlenecked by sort_by_key, but there is a way to avoid using it,
which is to implement it using hashmap(stdgpu).

What I am wondering is whether the algorithm is the same in pcl's cuda implementation and cuda-pcl.
If so, it is possible that changing the algorithm will not have much effect.
Do you also have cuda-pcl and pcl profiles?

ZhenshengLee · 2021-09-27T07:33:56Z

which is to implement it using hashmap(stdgpu).

there is a hashmap based implementation in https://github.com/JanuszBedkowski/gpu_computing_in_robotics , but a thrust based hashmap will be better for intergration into cupoch

What I am wondering is whether the algorithm is the same in pcl's cuda implementation and cuda-pcl.

the voxel downsample of cuda_pcl is provided by lib other than source, so the implemantation is unclear. And the performance of cuda-pcl is a bit better than that of cupoch and also robust in time consuming.

Actually I used the cpu version of pcl's voxel downsample, because its cuda module is unstable due to lack of maintainance.

If so, it is possible that changing the algorithm will not have much effect.

Yes you are right, but things get mixed when you use multiple third-party libs, addtional copy happens. While cupoch canbe a modern, unified way to preprocess pointcloud data. I'd like to check an implementation and rewrite it in thrust.

Do you also have cuda-pcl and pcl profiles?

So, the cuda-pcl executable may not providing enough info in nvvp, If you still need that, I will do it.

There is no need to profile pcl executable in nvvp, right?

BTW, You can use perception_cuda_pcl as a ros wrapper of cuda-pcl.

neka-nat · 2021-09-27T10:36:05Z

Thank you for your comment!
The implementation with hashmap seems to work well.
I'll look into the implementation too.

ZhenshengLee · 2021-09-27T16:24:30Z

I'll look into the implementation too.

I start to agree that thrust is great with STL-like container and allocator definitions, but the performance of some algorithms is suboptimal. These functions including sort may be replaced by raw cuda or another cpp wrapper.

Edit: I'd like to get an early try in this repo. https://github.com/ZhenshengLee/cupoch_contrib

neka-nat · 2022-04-18T09:57:13Z

@ZhenshengLee
I was wondering if you know if the VoxelGridFilter in cuda-pcl is a "Simple VoxelGridFilter"?

"Simple VoxelGridFilter" is a method that does not calculate the average of the point cloud in each grid, but uses the center value of the voxel as the filter solution.
From what I have found, the cuda version of Open3D uses this "Simple VoxelGridFilter". (Figure below)
If cuda-pcl is the "Simple VoxelGridFilter", that is why it is faster.

ZhenshengLee mentioned this issue Apr 20, 2021

[QST]About the best practice of memory copy. #61

Closed

ZhenshengLee closed this as completed Aug 18, 2021

ZhenshengLee reopened this Sep 17, 2021

ZhenshengLee changed the title ~~About performance in ros callback context.~~ [QST] performance in ros callback context. Sep 17, 2021

ZhenshengLee mentioned this issue Sep 17, 2021

[QST] Plan on interoperability with thrust and libcu++. eyalroz/cuda-api-wrappers#259

Closed

neka-nat added the enhancement New feature or request label Sep 27, 2021

ZhenshengLee changed the title ~~[QST] performance in ros callback context.~~ [BUG] performance in ros callback context. Sep 28, 2021

ZhenshengLee closed this as completed Apr 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] performance in ros callback context. #62

[BUG] performance in ros callback context. #62

ZhenshengLee commented Apr 20, 2021

ZhenshengLee commented Apr 20, 2021

neka-nat commented May 5, 2021

ZhenshengLee commented Aug 18, 2021

ZhenshengLee commented Sep 17, 2021 •

edited

neka-nat commented Sep 27, 2021

ZhenshengLee commented Sep 27, 2021

neka-nat commented Sep 27, 2021

ZhenshengLee commented Sep 27, 2021 •

edited

neka-nat commented Apr 18, 2022 •

edited

[BUG] performance in ros callback context. #62

[BUG] performance in ros callback context. #62

Comments

ZhenshengLee commented Apr 20, 2021

ZhenshengLee commented Apr 20, 2021

neka-nat commented May 5, 2021

ZhenshengLee commented Aug 18, 2021

ZhenshengLee commented Sep 17, 2021 • edited

neka-nat commented Sep 27, 2021

ZhenshengLee commented Sep 27, 2021

neka-nat commented Sep 27, 2021

ZhenshengLee commented Sep 27, 2021 • edited

neka-nat commented Apr 18, 2022 • edited

ZhenshengLee commented Sep 17, 2021 •

edited

ZhenshengLee commented Sep 27, 2021 •

edited

neka-nat commented Apr 18, 2022 •

edited