New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Fix benchmarks #2628

Merged

henningkayser merged 6 commits into moveit:main from rhaschke:fix-benchmarks

Feb 13, 2024

Contributor

rhaschke commented Dec 21, 2023

I had a closer look at the benchmarks presented in #2546 today. I noticed that the increasing number of states was not allocated simultaneously, but sequentially (thus actually simply freeing and re-allocating the same memory blocks).
This PR changes this to true parallel allocation, which mimics the real behavior in a planner more closely.
This and the additional changes of #2617 yield a much stronger relative slow-down than what was shown in #2546:

benchmark	ROS1	ROS2	increase
RobotStateBenchmark/construct/100	0.019 ms	0.020 ms	+5 %
RobotStateBenchmark/construct/1000	1.20 ms	1.25 ms	+4.1 %
RobotStateBenchmark/construct/10000	15.6 ms	16.0 ms	+2.5 %
RobotStateBenchmark/copyConstruct/100	0.018 ms	0.020 ms	+11.1 %
RobotStateBenchmark/copyConstruct/1000	1.18 ms	0.197 ms
RobotStateBenchmark/copyConstruct/10000	15.2 ms	16.3 ms	+7.2 %
RobotStateBenchmark/update/10	0.007 ms	0.008 ms	+14.2 %
RobotStateBenchmark/update/100	0.068 ms	0.086 ms	+26.4 %
RobotStateBenchmark/update/1000	0.710 ms	0.885 ms	+24.6 %
RobotStateBenchmark/update/10000	10.4 ms	11.7 ms	+12.5 %

Not only allocations are affected, but primarily also state updates, which occur much more frequently.
There is also a noticeable non-linearity when scaling up the amount of parallel states, which is probably due to more frequent cache misses.

rhaschke added 2 commits

December 20, 2023 15:39


          Simplify Isometry multiplication benchmarks

6d25448

With the benchmark library, there is no need to specify an iteration count.
Interestingly, 4x4 matrix multiplication is faster than affine*matrix


          Simplify/Fix Isometry inverse benchmarks

200acfb

rhaschke requested review from marioprats, tylerjw and henningkayser

December 21, 2023 16:00

rhaschke mentioned this pull request

Improve robot state update moveit/moveit#3548

Closed

codecov bot commented Dec 21, 2023 •

edited

Loading

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (0e1e376) 50.99% compared to head (1ddea2e) 51.10%.
Report is 24 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2628      +/-   ##
==========================================
+ Coverage   50.99%   51.10%   +0.12%     
==========================================
  Files         387      393       +6     
  Lines       32308    32753     +445     
==========================================
+ Hits        16473    16736     +263     
- Misses      15835    16017     +182

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


          Fix RobotState benchmarks

708b3e5

Actually allocate the given number of states instead of just one.

rhaschke force-pushed the fix-benchmarks branch from a3feae5 to 708b3e5 Compare

December 21, 2023 18:21

tylerjw assigned marioprats

marioprats reviewed

View reviewed changes

Contributor

marioprats left a comment

Hey Robert, thanks for making the benchmark more realistic, and cleaning it up, this is a great change.
I don't think it changes much about performance. We reported a slow down of 4x in #2546, which is worse than what we see here. I understand it's a few more ms in the end, but we think it's a worth price to pay for additional safety. We also think we can recover these loses in many other ways, like your own moveit/moveit#3548, or what we did with Jacobians in #2389, and many other improvements we can do. We should be able to have both performant and bullet-proof core types. But to reach that point we need good benchmarks and tests, so it's great you made this benchmark better and already using it to drive improvements. Thank you!

moveit_core/robot_state/test/robot_state_benchmark.cpp

-                    benchmark::DoNotOptimize(result = isometry.affine() * isometry.matrix());
-                    benchmark::ClobberMemory();
-                  }
+                  benchmark::DoNotOptimize(result.affine().noalias() = isometry.affine() * isometry.matrix());

Contributor

marioprats Dec 22, 2023

I kept the internal loop you had with MATRIX_OPS_N_ITERATIONS to get more significant differences. Otherwise this is in the order of ~10ns and the noise makes it harder to appreciate the difference. I'm fine either way, feel free to leave as is.

Contributor Author

rhaschke Dec 23, 2023

In my old code, multiple iterations were required for some statistical significance. However, the Google benchmark lib automatically iterates a to-be-timed code snipped as long as needed to reach statistical significance. Hence, it's not needed to be done manually anymore.

moveit_core/robot_state/test/robot_state_benchmark.cpp

+              BENCHMARK(multiplyMatrixTimesMatrixNoAlias);
+              BENCHMARK(multiplyIsometryTimesIsometryNoAlias);
+              BENCHMARK(multiplyMatrixTimesMatrix);
+              BENCHMARK(multiplyIsometryTimesIsometry);

Contributor

marioprats Dec 22, 2023

I would include back the multiplyAffineTimesMatrix benchmark, for completeness. We would then have the alias and noalias version of the three.
Also, what do you think about interleaving these, so that we see the alias and noalias version next to each other and is easier to compare?

Contributor Author

rhaschke Dec 23, 2023

I just tested different variants. It's quite obvious that NoAlias versions are faster.

Contributor

marioprats Jan 2, 2024

Good to know about the .noalias(). Unfortunately the default behavior is much more frequent in code, so it would be great to keep both versions of the three multiply benchmarks.

moveit_core/robot_state/test/robot_state_benchmark.cpp

               // Robot and planning group for benchmarks.
               constexpr char PANDA_TEST_ROBOT[] = "panda";
               constexpr char PANDA_TEST_GROUP[] = "panda_arm";
-              constexpr char PR2_TEST_ROBOT[] = "pr2";

Contributor

marioprats Dec 22, 2023

I think it makes sense to use the pr2 for the update() benchmark, because it has way more joints than the panda, and different types of joints (prismatic, revolute, fixed), so it will exercise the update() call much more.

Contributor Author

rhaschke Dec 23, 2023

Feel free to change. From my point of view it doesn't matter whether calling update() more often or having a longer-running update() call.

Contributor

marioprats Jan 2, 2024

It also includes different joint types, and a more complex tree structure in general, so I think it makes sense to keep the PR2 test. But I'm happy to change this in a follow-up PR.

moveit_core/robot_state/test/robot_state_benchmark.cpp Show resolved Hide resolved

moveit_core/robot_state/test/robot_state_benchmark.cpp

-                affine.matrix() = isometry.matrix();
+                Eigen::Affine3d affine = createTestIsometry();
+                Eigen::Affine3d result;
                 for (auto _ : st)

Contributor

marioprats Dec 22, 2023

I wonder if there's any compiler optimization happening here and affecting the results. What do you think?
I'm surprised to see that the inverse of a Matrix4d is the fastest with this PR. Originally, it was the slowest, with matched with your original benchmark too.
This is what I get with this PR:

inverseIsometry3d                             16.9 ns         16.9 ns    
inverseAffineIsometry                         18.6 ns         18.6 ns    
inverseAffine                                 31.4 ns         31.4 ns    
inverseMatrix4d                               10.5 ns         10.5 ns

And this is before this change, and I believe with your original benchmark too:

InverseIsometry3d/10000000                    107 ms          107 ms 
InverseAffineIsometry/10000000               23.3 ms         23.3 ms  
InverseAffine/10000000                       52.7 ms         52.7 ms 
InverseMatrix4d/10000000                      109 ms          109 ms

Contributor Author

rhaschke Dec 23, 2023

Yes, I think that there is some compiler optimization going on. However, I didn't manage to fix that using various variants of DoNotOptimize. Finally I had to move on to other stuff...

Contributor

marioprats Jan 2, 2024

What about keeping the local variable and the internal loop?
It is very suspicious that the inverse of a Matrix4d is way cheaper than the affine inverse.

Contributor Author

rhaschke Jan 29, 2024

The internal loop doesn't contribute anything when using Google's benchmark lib,
which does exactly that: performing the same operation in a loop for good statistics.
I can imagine that affine inverse is slower, because it cannot optimize memory allocation (needs to skip entries of last, fixed matrix row).

moveit_core/robot_state/test/robot_state_benchmark.cpp Outdated Show resolved Hide resolved

moveit_core/robot_state/test/robot_state_benchmark.cpp Show resolved Hide resolved

moveit_core/robot_state/test/robot_state_benchmark.cpp Outdated

+                  for (size_t i = 0; i < num; i++)
+                    result.push_back(i);
+                  std::random_device random_device;

Contributor

marioprats Dec 22, 2023

What do you think about using a deterministic random sequence instead?
Not sure if it will change anything, but at least it would create more deterministic conditions.

moveit_core/robot_state/test/robot_state_benchmark.cpp Show resolved Hide resolved

moveit_core/robot_state/test/robot_state_benchmark.cpp

-              BENCHMARK(robotStateConstruct)->RangeMultiplier(10)->Range(100, 10000)->Unit(benchmark::kMillisecond);
-              BENCHMARK(robotStateCopy)->RangeMultiplier(10)->Range(100, 10000)->Unit(benchmark::kMillisecond);
-              BENCHMARK(robotStateUpdate)->RangeMultiplier(10)->Range(10, 1000)->Unit(benchmark::kMillisecond);
-              BENCHMARK(robotStateForwardKinematics)->RangeMultiplier(10)->Range(10, 1000)->Unit(benchmark::kMillisecond);

Contributor

marioprats Dec 22, 2023

See my other comment. I think it still makes sense to keep this one, but maybe rename to robotStateGetGlobalLinkTransform, since that's what it calls?


          fixup! Fix RobotState benchmarks

Address review comments

marioprats requested changes

View reviewed changes

moveit_core/robot_state/test/robot_state_benchmark.cpp Show resolved Hide resolved

moveit_core/robot_state/test/robot_state_benchmark.cpp

-                affine.matrix() = isometry.matrix();
+                Eigen::Affine3d affine = createTestIsometry();
+                Eigen::Affine3d result;
                 for (auto _ : st)

Contributor

marioprats Jan 2, 2024

What about keeping the local variable and the internal loop?
It is very suspicious that the inverse of a Matrix4d is way cheaper than the affine inverse.

moveit_core/robot_state/test/robot_state_benchmark.cpp Outdated Show resolved Hide resolved

moveit_core/robot_state/test/robot_state_benchmark.cpp Show resolved Hide resolved

moveit_core/robot_state/test/robot_state_benchmark.cpp

+              BENCHMARK(multiplyMatrixTimesMatrixNoAlias);
+              BENCHMARK(multiplyIsometryTimesIsometryNoAlias);
+              BENCHMARK(multiplyMatrixTimesMatrix);
+              BENCHMARK(multiplyIsometryTimesIsometry);

Contributor

marioprats Jan 2, 2024

Good to know about the .noalias(). Unfortunately the default behavior is much more frequent in code, so it would be great to keep both versions of the three multiply benchmarks.

henningkayser added discuss and removed discuss labels


          Use assignment to std::ignore instead of empty if condition.

5da908f

Co-authored-by: Mario Prats <mario.prats@picknik.ai>

rhaschke force-pushed the fix-benchmarks branch from 512e182 to 5da908f Compare

January 29, 2024 16:28


          Improve comments

1ddea2e

Co-authored-by: Mario Prats <mario.prats@picknik.ai>

Contributor Author

rhaschke commented Jan 29, 2024

Is there anything more to do for this to get merged, @marioprats?

Contributor

marioprats commented Jan 30, 2024

I think we can merge this, thanks Robert!

marioprats approved these changes

View reviewed changes

henningkayser approved these changes

View reviewed changes

henningkayser merged commit fbad1c2 into moveit:main

12 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment