Skip to content
This repository has been archived by the owner on Jul 26, 2024. It is now read-only.

RGB Pointcloud slows down the publishing rate a lot #56

Open
RoseFlunder opened this issue Aug 22, 2019 · 18 comments
Open

RGB Pointcloud slows down the publishing rate a lot #56

RoseFlunder opened this issue Aug 22, 2019 · 18 comments
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed triage approved The Issue has been reviewed and approved for development by the Azure Kinect ROS Driver Team

Comments

@RoseFlunder
Copy link
Contributor

Describe the bug
When enabling the rgb_point_cloud the performance of the node degrades a lot. Without the colorized pointcloud the node can maintain a steady 30 hz publishing rate when using 30 FPS with the kinect.
With the rgb_point_cloud enabled the publishing rate decreases to like 10-12 Hz.
When visualizing the point cloud in RViz the slow publishing rate leds to a laggy view.
RViz itself maintains high enough FPS displaying the pointcloud but the data comes in too slow.
When using the k4aviewer there is no slowdown in colorized point cloud view.

To Reproduce
Steps to reproduce the behavior:

  1. Start the node via the driver.launch and use fps:=30, depth_mode:=NFOV_UNBINNED, rgb_point_cloud:=false
  2. Call "rostopic hz points2" in another terminal
  3. Restart the node with rgb_point_cloud:=true
  4. Call "rostopic hz points2" again and compare the rate with step 2

Expected behavior
Maintain the 30 FPS rate for publishing the rgb pointcloud on machines that can easily view the colorized point cloud at 30 FPS in the k4aviewer.

Desktop (please complete the following information):

  • OS: Ubuntu 18.04
@RoseFlunder RoseFlunder added bug Something isn't working triage needed The Issue still needs to be reviewed by the Azure Kinect ROS Driver Team labels Aug 22, 2019
@skalldri
Copy link
Contributor

Confirmed this is occurring on my ROS Melodic, Ubuntu 18.04 machine. I suspect the main image processing loop is running too slowly on the CPU, since the K4A Viewer does the RGB point cloud math on the GPU.

I'll investigate ways I could do this faster, potentially using SIMD instructions.

@skalldri
Copy link
Contributor

Confirmed, the image processing loop take between 40-70ms when the RGB point cloud is enabled. I'll look for ways to potentially speed it up.

@skalldri skalldri added help wanted Extra attention is needed triage approved The Issue has been reviewed and approved for development by the Azure Kinect ROS Driver Team and removed triage needed The Issue still needs to be reviewed by the Azure Kinect ROS Driver Team labels Aug 26, 2019
@RoseFlunder
Copy link
Contributor Author

RoseFlunder commented Aug 26, 2019

@skalldri
Is a GPU accelerated version planned?
In the usage.md this is stated but there isn't any GPU acceleration in the currently used SDK methods I think?

Using the point cloud functions in this node will provide GPU-accelerated results, but the same quality point clouds can be produced using standard ROS tools.

Today I saw this example in the sensor sdk repository:
Fast Pointcloud Example
Would this version already be faster even on CPU or is it comparable?
I guess this algorithm is implemented on the GPU for the k4aviewer isn't it?

Tomorrow I am planning to use the CPU version of this algorithm in the node to check its performance compared to the current version.
If I get it working I will create branch in my fork for it.

@skalldri
Copy link
Contributor

The fast point-cloud example does not produce RGB point clouds, just regular point clouds. It's just a way to produce point clouds quickly without using SSE3.

My documentation in usage.md is incorrect: we use SSE3-accelerated point cloud math, not GPU-accelerated. See:

https://github.com/microsoft/Azure-Kinect-Sensor-SDK/blob/c0800a43fb42200935aa9a8da58b8bd5db3d55c9/src/transformation/rgbz.c#L1210

Yes: K4A Viewer does this all on the GPU. Since K4A viewer is going to be drawing everything to the screen as the final destination (rather than writing it to a ROS message), it makes sense for it to do as much work as possible in a shader. They wrote a shader that accepts a depth image and color image as input, as well as the X-Y tables, and then naively draws the RGB point cloud. That's a very fast operation so it runs super quickly.

Copying the point cloud and RGB frame to the GPU, doing some work, then copying it back off the GPU is going to take time. I'd like to take a stab at improving the speed of the CPU-based version before resorting to

@RoseFlunder
Copy link
Contributor Author

Thanks for the detailed answer :) I am looking forward to it because thats one of our main blocking points at the moment in our project.

@RoseFlunder
Copy link
Contributor Author

RoseFlunder commented Sep 12, 2019

@skalldri
So like I already mentioned in another issue I did some time measuring.
In this gist I just pasted the modified part of the rgb point cloud method together with the output.
I don't know if ROS Time is very accurate but we can clearly see that the SDKs methods are fine performance wise.
Whats really dragging down is the loop that creates the message:
https://gist.github.com/RoseFlunder/0cca13211022049ae7abc98ea600a7de

EDIT:
Building the package in release mode drastically speeds this loop up.
Use: catkin_make -DCMAKE_BUILD_TYPE=Release
The loop is about 3 times faster with this (~14ms instead of ~45ms in my example).
I added the output for this built also to the gist.

EDIT2:
Just did another test what happens when replacing local variables, like BgraPixel and depth_point_3d, with only expressions from the buffer and putting the divide by 1000 part also in the expression.
That gives about 1ms on average. So the local variables for better readability aren't a big deal.

@skalldri
Copy link
Contributor

Thanks for digging into this issue RoseFlunder. I suspect the iterator used to fill up the PointCloud2 message is very expensive. There is almost certainly a faster way to copy the data from the SDK structures into the PointCloud2 message.

Unfortunately, I'm on vacation until the 25th so I won't be able to make much progress on this until I get back.

@skalldri skalldri self-assigned this Oct 10, 2019
@skalldri skalldri removed the help wanted Extra attention is needed label Oct 10, 2019
@skalldri
Copy link
Contributor

Going to start having a look at this.

@skalldri
Copy link
Contributor

skalldri commented Oct 22, 2019

EDIT:

Never mind, the original code has a simple mistake: not calling pcl_modifier.resize() before trying to modify the point cloud! Using the sped-up branch helps, but still doesn't get us over the finish line. Digging further into Valgrind on the latest code.

ORIGINAL:

Well, Valgrind seems to think that using PointCloud2Iterator is very expensive:

valgrind-iterators

Combined, PointCloud2Iterator::operator++ and PointCloud2Iterator::operator* account for ~20% of the time spent in getRgbPointCloud()! While not confirmed, this makes me think that we could see significant speedups if we do this the old fashioned way.

I'll see if I can build a version of the point cloud functions that doesn't use the iterator.

@skalldri
Copy link
Contributor

It's good to see that this is using memset_avx2_erms: that's probably the compiler optimizing the PointCloud2 alloc.

@emmanuel-senft
Copy link

Hi, I am also affected by this bug, and I was wondering if there had been any progress on that side?

@ooeygui ooeygui added the help wanted Extra attention is needed label Jan 10, 2020
cthorey pushed a commit to Xihelm/Azure_Kinect_ROS_Driver that referenced this issue Mar 27, 2020
Using fillColorPointCloud is too slow. There is an open bug
microsoft#56
about it.

Instead, this publish on the xyz values in the camera frame which
is what we need. The nb of points is equal to h*w of the raw_image
and can be easily recombined if needed later on.
@anastasiabolotnikova
Copy link

Hello, glad to know that this issue is getting attention and already started to be investigated. Thanks @RoseFlunder @skalldri for sharing your findings!

This is indeed a limitation of the current driver implementation that should be addressed to speed up the loop cycle. Both fillColorPointCloud and fillPointCloud could be much faster if going over the entire point cloud sequentially point by point is somehow avoided.

Three questions that I would like to ask regarding this issue:

  1. Is there no way of converting 3D points coordinates data from K4A SDK structure (e.g. buffer) into sensor_msgs::PointCloud2.data binary blob using memcpy? I suppose that would be the fastest way to deal with it on machines without GPU?

  2. Why new PointCloud2 is being created and resized on every cycle? I suppose it only needs to be created and set to correct size once before the while loop of framePublisherThread?

  3. A parallel implementation of fillColorPointCloud and fillPointCloud for machines with GPU would be great! Especially if there is no way to copy points from SDK structure into ROS message using memcpy. Could you share with us any information regarding the plans of this implementation?

Thank you in advance for the reply!

@ooeygui
Copy link
Member

ooeygui commented Apr 13, 2020

Hi All,
We Azure Kinect to address a few performance related issues, as well as Migrating to ROS2 on our backlog, but likely won't be able to start working on it until the summer.

@YoshuaNava
Copy link

YoshuaNava commented Oct 19, 2020

@skalldri I'm also having issues with the PointCloud2Iterator, although I'm not a ser of Kinect Azure. Do you still plan to optimize it? I would be happy to collaborate and find out better ways to use it, or come up with improvements.

@ooeygui
Copy link
Member

ooeygui commented Oct 19, 2020

Hi @YoshuaNava ,
This work is still on the backlog.

The code currently does this work linearly in C++

k4a_result_t K4AROSDevice::fillColorPointCloud(const k4a::image& pointcloud_image, const k4a::image& color_image,
.

We are looking at Cuda for this codepath when available.

@YoshuaNava
Copy link

@ooeygui Thank you for your prompt answer.

I see. I have continued using the iterator. Until now it allowed me to speed up an old raw deserialization method by 3-5x, which is a good gain with considerable low effort.

I'll share any insights found while I work on this.

@ibrachahine
Copy link

Hi, I am prototyping for an application requiring the depth topic on ROS and it seems to be very slow at 1Hz. Is there any recent updates on this issue?

Thanks

@ooeygui
Copy link
Member

ooeygui commented Mar 11, 2022

@ibrachahine Thank you for your interest and ask here. We have not invested in fixing this bug, but would accept a pull request. Have you tried workign with parameters on the node to optimize for your hardware and scenario?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working help wanted Extra attention is needed triage approved The Issue has been reviewed and approved for development by the Azure Kinect ROS Driver Team
Projects
None yet
Development

No branches or pull requests

7 participants