New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] performance in ros callback context. #62
Comments
the code is in below. // prj hdrs
#include "cupoch_conversions/cupoch_conversions.h"
// ros hdrs
#include <ros/ros.h>
#include <sensor_msgs/PointCloud2.h>
using namespace std;
using namespace cupoch;
std::string camera_point_topic;
auto cloud = std::make_shared<geometry::PointCloud>();
sensor_msgs::PointCloud2 m_pub_cupoch_pc;
void points_callback(const sensor_msgs::PointCloud2ConstPtr& msg)
{
auto start = ros::WallTime::now();
auto end = ros::WallTime::now();
auto t1 = ros::WallTime::now();
auto t2 = ros::WallTime::now();
t1 = ros::WallTime::now();
cupoch_conversions::rosToCupoch(msg, cloud);
t2 = ros::WallTime::now();
ROS_INFO_STREAM("detect_cupoch_thread rosToCupoch time: " << (t2 - t1).toSec() * 1000.0 << "[ms]");
if (cloud->HasPoints())
{
ROS_INFO("m_detect_cupoch_thread - point size is %d .", cloud->points_.size());
}
// 发布滤波后点云
t1 = ros::WallTime::now();
cloud = cloud->PassThroughFilter(0, -0.5, 0.5);
t2 = ros::WallTime::now();
ROS_INFO_STREAM("detect_cupoch_thread passXYZFilter time: " << (t2 - t1).toSec() * 1000.0 << "[ms]");
ROS_INFO("m_detect_cupoch_thread passXYZFilter size is %d .", cloud->points_.size());
t1 = ros::WallTime::now();
cupoch_conversions::cupochToRos(cloud, m_pub_cupoch_pc, "/gpuac/pointcloud_frame");
m_pub_cupoch_pc.header.stamp = ros::Time::now();
m_pub_cupoch_pc.header.frame_id = "/gpuac/pointcloud_frame";
t2 = ros::WallTime::now();
ROS_INFO_STREAM("detect_cupoch_thread cupochToRos time: " << (t2 - t1).toSec() * 1000.0 << "[ms]");
end = ros::WallTime::now();
ROS_INFO_STREAM("detect_cupoch_thread processing_time: " << (end - start).toSec() * 1000.0 << "[ms]");
}
int main(int argc, char** argv)
{
ros::init(argc, argv, "cupoch_conversions_test_node");
ros::NodeHandle private_nh("~");
utility::InitializeAllocator(utility::PoolAllocation, 1000000000);
utility::SetVerbosityLevel(utility::VerbosityLevel::Debug);
cudaSetDeviceFlags( cudaDeviceScheduleBlockingSync);
private_nh.param("camera3d_point_topic", camera_point_topic, std::string("/points_cloud"));
auto points_sub = private_nh.subscribe(camera_point_topic, 10, points_callback);
ros::spin();
return 0;
} |
Hi! |
@neka-nat I'v found that cpu freq has an impact to memcpy time between device and host. The cpu freq should be set to the higghest mode. sudo cpufreq-set -g performance May closing this issue. |
Hi @neka-nat I finally used nvvp to profile ros callback of voxel-downsample in the context of ROS. You can check a performance test result of downsample a 11w points cloud in https://github.com/ZhenshengLee/ga_points_downsampler/blob/master/README.md In fact, the performance of downsampler is not better than that of cuda_pcl in https://github.com/NVIDIA-AI-IOT/cuda-pcl after setting this tunning cmd , I get profile of cupoch's voxel downsample, the code is in this repo https://github.com/ZhenshengLee/ga_points_downsampler nvidia-settings -a '[gpu:0]/GPUPowerMizerMode=1'
sudo cpufreq-set -g performance As sort_by_key has no async version of this according to #71 (comment) Is there any method to improve the performance of this thrust based cuda program? Thanks. Edit , the nvvp file is uploaded if you like. created by nvvp 11.0. |
Hi, What I am wondering is whether the algorithm is the same in pcl's cuda implementation and cuda-pcl. |
there is a hashmap based implementation in https://github.com/JanuszBedkowski/gpu_computing_in_robotics , but a thrust based hashmap will be better for intergration into cupoch
the voxel downsample of cuda_pcl is provided by lib other than source, so the implemantation is unclear. And the performance of cuda-pcl is a bit better than that of cupoch and also robust in time consuming. Actually I used the cpu version of pcl's voxel downsample, because its cuda module is unstable due to lack of maintainance.
Yes you are right, but things get mixed when you use multiple third-party libs, addtional copy happens. While cupoch canbe a modern, unified way to preprocess pointcloud data. I'd like to check an implementation and rewrite it in thrust.
So, the cuda-pcl executable may not providing enough info in nvvp, If you still need that, I will do it. There is no need to profile pcl executable in nvvp, right? BTW, You can use perception_cuda_pcl as a ros wrapper of cuda-pcl. |
Thank you for your comment! |
I start to agree that thrust is great with STL-like container and allocator definitions, but the performance of some algorithms is suboptimal. These functions including sort may be replaced by raw cuda or another cpp wrapper. Edit: I'd like to get an early try in this repo. https://github.com/ZhenshengLee/cupoch_contrib |
@ZhenshengLee "Simple VoxelGridFilter" is a method that does not calculate the average of the point cloud in each grid, but uses the center value of the voxel as the filter solution. |
As #60 did, I gave more tests on performance.
The passfilter function costs 0.096551ms to filter 119978 points to 5510.
But this output is in a simple execution context, you can find the code in #60
When the function run in ros callback context, the performance becomes really low, as 0.670853ms
Any ideas about the reason?
The text was updated successfully, but these errors were encountered: