Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] performance in ros callback context. #62

Closed
ZhenshengLee opened this issue Apr 20, 2021 · 9 comments
Closed

[BUG] performance in ros callback context. #62

ZhenshengLee opened this issue Apr 20, 2021 · 9 comments
Labels
enhancement New feature or request

Comments

@ZhenshengLee
Copy link
Contributor

As #60 did, I gave more tests on performance.

The passfilter function costs 0.096551ms to filter 119978 points to 5510.

image

But this output is in a simple execution context, you can find the code in #60

When the function run in ros callback context, the performance becomes really low, as 0.670853ms

image

Any ideas about the reason?

@ZhenshengLee
Copy link
Contributor Author

the code is in below.

// prj hdrs
#include "cupoch_conversions/cupoch_conversions.h"

// ros hdrs
#include <ros/ros.h>
#include <sensor_msgs/PointCloud2.h>

using namespace std;
using namespace cupoch;

std::string camera_point_topic;
auto cloud = std::make_shared<geometry::PointCloud>();
sensor_msgs::PointCloud2 m_pub_cupoch_pc;

void points_callback(const sensor_msgs::PointCloud2ConstPtr& msg)
{
    auto start = ros::WallTime::now();
    auto end = ros::WallTime::now();
    auto t1 = ros::WallTime::now();
    auto t2 = ros::WallTime::now();

    t1 = ros::WallTime::now();
    cupoch_conversions::rosToCupoch(msg, cloud);
    t2 = ros::WallTime::now();
    ROS_INFO_STREAM("detect_cupoch_thread rosToCupoch time: " << (t2 - t1).toSec() * 1000.0 << "[ms]");

    if (cloud->HasPoints())
    {
        ROS_INFO("m_detect_cupoch_thread - point size is %d .", cloud->points_.size());
    }

    // 发布滤波后点云
    t1 = ros::WallTime::now();
    cloud = cloud->PassThroughFilter(0, -0.5, 0.5);
    t2 = ros::WallTime::now();
    ROS_INFO_STREAM("detect_cupoch_thread passXYZFilter time: " << (t2 - t1).toSec() * 1000.0 << "[ms]");
    ROS_INFO("m_detect_cupoch_thread passXYZFilter size is %d .", cloud->points_.size());

    t1 = ros::WallTime::now();
    cupoch_conversions::cupochToRos(cloud, m_pub_cupoch_pc, "/gpuac/pointcloud_frame");
    m_pub_cupoch_pc.header.stamp = ros::Time::now();
    m_pub_cupoch_pc.header.frame_id = "/gpuac/pointcloud_frame";
    t2 = ros::WallTime::now();
    ROS_INFO_STREAM("detect_cupoch_thread cupochToRos time: " << (t2 - t1).toSec() * 1000.0 << "[ms]");

    end = ros::WallTime::now();
    ROS_INFO_STREAM("detect_cupoch_thread processing_time: " << (end - start).toSec() * 1000.0 << "[ms]");
}

int main(int argc, char** argv)
{
    ros::init(argc, argv, "cupoch_conversions_test_node");
    ros::NodeHandle private_nh("~");

    utility::InitializeAllocator(utility::PoolAllocation, 1000000000);
    utility::SetVerbosityLevel(utility::VerbosityLevel::Debug);
    cudaSetDeviceFlags( cudaDeviceScheduleBlockingSync);

    private_nh.param("camera3d_point_topic", camera_point_topic, std::string("/points_cloud"));

    auto points_sub = private_nh.subscribe(camera_point_topic, 10, points_callback);

    ros::spin();

    return 0;
}

@neka-nat
Copy link
Owner

neka-nat commented May 5, 2021

Hi!
Why don't you try taking a profile?
If you compare it to the #60 code, you may find something.

@ZhenshengLee
Copy link
Contributor Author

@neka-nat I'v found that cpu freq has an impact to memcpy time between device and host.

The cpu freq should be set to the higghest mode.

sudo cpufreq-set -g performance

May closing this issue.

@ZhenshengLee
Copy link
Contributor Author

ZhenshengLee commented Sep 17, 2021

Hi @neka-nat I finally used nvvp to profile ros callback of voxel-downsample in the context of ROS.

You can check a performance test result of downsample a 11w points cloud in https://github.com/ZhenshengLee/ga_points_downsampler/blob/master/README.md

In fact, the performance of downsampler is not better than that of cuda_pcl in https://github.com/NVIDIA-AI-IOT/cuda-pcl
AlThouth memory pool tech of rmm is used, which beyond my expectations.

after setting this tunning cmd , I get profile of cupoch's voxel downsample, the code is in this repo https://github.com/ZhenshengLee/ga_points_downsampler

nvidia-settings -a '[gpu:0]/GPUPowerMizerMode=1'
sudo cpufreq-set -g performance

image

As sort_by_key has no async version of this according to #71 (comment)

Is there any method to improve the performance of this thrust based cuda program?

Thanks.

Edit , the nvvp file is uploaded if you like.
pointdownsampler.nvvp.zip

created by nvvp 11.0.

@ZhenshengLee ZhenshengLee changed the title About performance in ros callback context. [QST] performance in ros callback context. Sep 17, 2021
@neka-nat
Copy link
Owner

Hi,
Thank you for reporting.
The current voxelfilter algorithm is bottlenecked by sort_by_key, but there is a way to avoid using it,
which is to implement it using hashmap(stdgpu).

What I am wondering is whether the algorithm is the same in pcl's cuda implementation and cuda-pcl.
If so, it is possible that changing the algorithm will not have much effect.
Do you also have cuda-pcl and pcl profiles?

@neka-nat neka-nat added the enhancement New feature or request label Sep 27, 2021
@ZhenshengLee
Copy link
Contributor Author

which is to implement it using hashmap(stdgpu).

there is a hashmap based implementation in https://github.com/JanuszBedkowski/gpu_computing_in_robotics , but a thrust based hashmap will be better for intergration into cupoch

What I am wondering is whether the algorithm is the same in pcl's cuda implementation and cuda-pcl.

the voxel downsample of cuda_pcl is provided by lib other than source, so the implemantation is unclear. And the performance of cuda-pcl is a bit better than that of cupoch and also robust in time consuming.

Actually I used the cpu version of pcl's voxel downsample, because its cuda module is unstable due to lack of maintainance.

If so, it is possible that changing the algorithm will not have much effect.

Yes you are right, but things get mixed when you use multiple third-party libs, addtional copy happens. While cupoch canbe a modern, unified way to preprocess pointcloud data. I'd like to check an implementation and rewrite it in thrust.

Do you also have cuda-pcl and pcl profiles?

So, the cuda-pcl executable may not providing enough info in nvvp, If you still need that, I will do it.

There is no need to profile pcl executable in nvvp, right?

BTW, You can use perception_cuda_pcl as a ros wrapper of cuda-pcl.

@neka-nat
Copy link
Owner

Thank you for your comment!
The implementation with hashmap seems to work well.
I'll look into the implementation too.

@ZhenshengLee
Copy link
Contributor Author

ZhenshengLee commented Sep 27, 2021

I'll look into the implementation too.

I start to agree that thrust is great with STL-like container and allocator definitions, but the performance of some algorithms is suboptimal. These functions including sort may be replaced by raw cuda or another cpp wrapper.

Edit: I'd like to get an early try in this repo. https://github.com/ZhenshengLee/cupoch_contrib

@ZhenshengLee ZhenshengLee changed the title [QST] performance in ros callback context. [BUG] performance in ros callback context. Sep 28, 2021
@neka-nat
Copy link
Owner

neka-nat commented Apr 18, 2022

@ZhenshengLee
I was wondering if you know if the VoxelGridFilter in cuda-pcl is a "Simple VoxelGridFilter"?

"Simple VoxelGridFilter" is a method that does not calculate the average of the point cloud in each grid, but uses the center value of the voxel as the filter solution.
From what I have found, the cuda version of Open3D uses this "Simple VoxelGridFilter". (Figure below)
If cuda-pcl is the "Simple VoxelGridFilter", that is why it is faster.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants