Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented an in-place method for transforming DataPoints objects #419

Conversation

YoshuaNava
Copy link
Contributor

@YoshuaNava YoshuaNava commented Nov 19, 2020

I noticed that some computations relying on libpointmatcher were losing time transforming point clouds between sensor and map frames.

I looked into the libpointmatcher Transformations' code, and, following the same pattern as the DataPointsFilter, I implemented an inPlace processing method for computing transformations.

This should reduce the number of copies by two: one for copying the input in the compute() function, and the other one for copying the return value (note it could also be moved).

@YoshuaNava
Copy link
Contributor Author

YoshuaNava commented Nov 19, 2020

I compared multiple implementations:

  1. In place multiply:
cloud.features = parameters * cloud.features;
  1. Apply on the left:
cloud.applyOnTheLeft(parameters);
  1. Transpose multiply:
cloud.features.transpose() *= parameters.transpose();

In terms of timing, all of them are in sub-millisecond, for all the point clouds I tried (depth camera, LIDAR, SLAM).

Transpose seems to be the slowest method, while in place multiply and apply on the left tend to be the fastests, although there doesn't seem to be a single winner there. (NOTE: after testing again I have found that my testing setup made inefficient use of Eigen operations, transpose was actually the fastest when properly implemented)

For the final implementation I think I'll go with applyOnTheLeft, because the signature of the function is very idiomatic, and the performance improves.


When it comes to the old vs. new methods, the old method is 1.5-10x slower than the new ones, just because of the copy.

@pomerlef
Copy link
Collaborator

Since you are digging into this. You can optimize the inverse of a transformation matrix using:
image

In Python, skipping the inverse lead to some improvements:

With inverse
10.8 µs ± 42.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Without inverse
9.42 µs ± 40 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Improvement of 12.27%

@YoshuaNava
Copy link
Contributor Author

YoshuaNava commented Nov 20, 2020

@pomerlef Thank you for the input! Do you have any specific place in mind for this change?

After running a quick search I found the following occurrences:

Screenshot_2020-11-20_07-46-20

They're not so many, but only the ones on ICP.cpp seem related to transformations?

@pomerlef
Copy link
Collaborator

I didn't check in the code what would be the impact (apparently minor). I just came across this formulation for rigid transformation inverse recently. It was mostly to log it somewhere.

@YoshuaNava
Copy link
Contributor Author

@pomerlef I extended the unit tests, directly validating the result of transforming DataPoints objects. I also noticed that the similarity transform was not registered, neither had its own constructor, so I added it.

I quickly ran a benchmark of the unit tests on Intel VTune, with the HW sampler, obtaining the following results:

Before:

image

After:

image

There is a ~10% improvement in running time and a small bump in microarchitecture usage (could be because we do less copies, so less time moving data in memory)

utest/utest.h Show resolved Hide resolved
@pomerlef
Copy link
Collaborator

pomerlef commented Dec 1, 2020

Looking good! An improvement of 10 % is quite good. Thanks for this PR @YoshuaNava.

@pomerlef pomerlef merged commit fba9de4 into norlab-ulaval:master Dec 1, 2020
@YoshuaNava
Copy link
Contributor Author

YoshuaNava commented Dec 1, 2020

@pomerlef Thank you.

Where I saw the benefit of the change the most is when running localization and mapping with parameters for very dense map construction. There the time spent transforming point clouds went down from 7 to around 4%.

For live operation, with point clouds and maps downsampled to have 5 to ~10cm density, the overall improvement was from around 5 to 4%.


From all the time spent by the transformation functions, 30% is transforming the features https://github.com/YoshuaNava/libpointmatcher/blob/feature/transformations_inplace_compute/pointmatcher/TransformationsImpl.cpp#L191
and the rest is transforming the descriptors: https://github.com/YoshuaNava/libpointmatcher/blob/feature/transformations_inplace_compute/pointmatcher/TransformationsImpl.cpp#L202

I tried to come up with further optimizations, but couldn't find something faster than Eigen block operations.

For future reference, this is the call stack of the inplacecompute function of rigidtransformation:
image

@YoshuaNava
Copy link
Contributor Author

YoshuaNava commented Dec 2, 2020

Update: I did find a way to optimize the descriptor assignment further. See snippet here: https://godbolt.org/z/9W547n

Another benchmark for the different methods evaluated before introducing this PR: https://godbolt.org/z/r88h6G

I'll submit another PR tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants