Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix benchmarks #2628

Merged
merged 6 commits into from
Feb 13, 2024
Merged

Fix benchmarks #2628

merged 6 commits into from
Feb 13, 2024

Conversation

rhaschke
Copy link
Contributor

I had a closer look at the benchmarks presented in #2546 today. I noticed that the increasing number of states was not allocated simultaneously, but sequentially (thus actually simply freeing and re-allocating the same memory blocks).
This PR changes this to true parallel allocation, which mimics the real behavior in a planner more closely.
This and the additional changes of #2617 yield a much stronger relative slow-down than what was shown in #2546:

benchmark ROS1 ROS2 increase
RobotStateBenchmark/construct/100 0.019 ms 0.020 ms +5 %
RobotStateBenchmark/construct/1000 1.20 ms 1.25 ms +4.1 %
RobotStateBenchmark/construct/10000 15.6 ms 16.0 ms +2.5 %
RobotStateBenchmark/copyConstruct/100 0.018 ms 0.020 ms +11.1 %
RobotStateBenchmark/copyConstruct/1000 1.18 ms 0.197 ms
RobotStateBenchmark/copyConstruct/10000 15.2 ms 16.3 ms +7.2 %
RobotStateBenchmark/update/10 0.007 ms 0.008 ms +14.2 %
RobotStateBenchmark/update/100 0.068 ms 0.086 ms +26.4 %
RobotStateBenchmark/update/1000 0.710 ms 0.885 ms +24.6 %
RobotStateBenchmark/update/10000 10.4 ms 11.7 ms +12.5 %

Not only allocations are affected, but primarily also state updates, which occur much more frequently.
There is also a noticeable non-linearity when scaling up the amount of parallel states, which is probably due to more frequent cache misses.

With the benchmark library, there is no need to specify an iteration count.
Interestingly, 4x4 matrix multiplication is faster than affine*matrix
Copy link

codecov bot commented Dec 21, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (0e1e376) 50.99% compared to head (1ddea2e) 51.10%.
Report is 24 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2628      +/-   ##
==========================================
+ Coverage   50.99%   51.10%   +0.12%     
==========================================
  Files         387      393       +6     
  Lines       32308    32753     +445     
==========================================
+ Hits        16473    16736     +263     
- Misses      15835    16017     +182     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Actually allocate the given number of states instead of just one.
Copy link
Contributor

@marioprats marioprats left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Robert, thanks for making the benchmark more realistic, and cleaning it up, this is a great change.
I don't think it changes much about performance. We reported a slow down of 4x in #2546, which is worse than what we see here. I understand it's a few more ms in the end, but we think it's a worth price to pay for additional safety. We also think we can recover these loses in many other ways, like your own moveit/moveit#3548, or what we did with Jacobians in #2389, and many other improvements we can do. We should be able to have both performant and bullet-proof core types. But to reach that point we need good benchmarks and tests, so it's great you made this benchmark better and already using it to drive improvements. Thank you!

benchmark::DoNotOptimize(result = isometry.affine() * isometry.matrix());
benchmark::ClobberMemory();
}
benchmark::DoNotOptimize(result.affine().noalias() = isometry.affine() * isometry.matrix());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the internal loop you had with MATRIX_OPS_N_ITERATIONS to get more significant differences. Otherwise this is in the order of ~10ns and the noise makes it harder to appreciate the difference. I'm fine either way, feel free to leave as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my old code, multiple iterations were required for some statistical significance. However, the Google benchmark lib automatically iterates a to-be-timed code snipped as long as needed to reach statistical significance. Hence, it's not needed to be done manually anymore.

BENCHMARK(multiplyMatrixTimesMatrixNoAlias);
BENCHMARK(multiplyIsometryTimesIsometryNoAlias);
BENCHMARK(multiplyMatrixTimesMatrix);
BENCHMARK(multiplyIsometryTimesIsometry);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would include back the multiplyAffineTimesMatrix benchmark, for completeness. We would then have the alias and noalias version of the three.
Also, what do you think about interleaving these, so that we see the alias and noalias version next to each other and is easier to compare?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested different variants. It's quite obvious that NoAlias versions are faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know about the .noalias(). Unfortunately the default behavior is much more frequent in code, so it would be great to keep both versions of the three multiply benchmarks.


// Robot and planning group for benchmarks.
constexpr char PANDA_TEST_ROBOT[] = "panda";
constexpr char PANDA_TEST_GROUP[] = "panda_arm";
constexpr char PR2_TEST_ROBOT[] = "pr2";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to use the pr2 for the update() benchmark, because it has way more joints than the panda, and different types of joints (prismatic, revolute, fixed), so it will exercise the update() call much more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to change. From my point of view it doesn't matter whether calling update() more often or having a longer-running update() call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also includes different joint types, and a more complex tree structure in general, so I think it makes sense to keep the PR2 test. But I'm happy to change this in a follow-up PR.

affine.matrix() = isometry.matrix();

Eigen::Affine3d affine = createTestIsometry();
Eigen::Affine3d result;
for (auto _ : st)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's any compiler optimization happening here and affecting the results. What do you think?
I'm surprised to see that the inverse of a Matrix4d is the fastest with this PR. Originally, it was the slowest, with matched with your original benchmark too.
This is what I get with this PR:

inverseIsometry3d                             16.9 ns         16.9 ns    
inverseAffineIsometry                         18.6 ns         18.6 ns    
inverseAffine                                 31.4 ns         31.4 ns    
inverseMatrix4d                               10.5 ns         10.5 ns    

And this is before this change, and I believe with your original benchmark too:

InverseIsometry3d/10000000                    107 ms          107 ms 
InverseAffineIsometry/10000000               23.3 ms         23.3 ms  
InverseAffine/10000000                       52.7 ms         52.7 ms 
InverseMatrix4d/10000000                      109 ms          109 ms   

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that there is some compiler optimization going on. However, I didn't manage to fix that using various variants of DoNotOptimize. Finally I had to move on to other stuff...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about keeping the local variable and the internal loop?
It is very suspicious that the inverse of a Matrix4d is way cheaper than the affine inverse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internal loop doesn't contribute anything when using Google's benchmark lib,
which does exactly that: performing the same operation in a loop for good statistics.
I can imagine that affine inverse is slower, because it cannot optimize memory allocation (needs to skip entries of last, fixed matrix row).

moveit_core/robot_state/test/robot_state_benchmark.cpp Outdated Show resolved Hide resolved
for (size_t i = 0; i < num; i++)
result.push_back(i);

std::random_device random_device;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using a deterministic random sequence instead?
Not sure if it will change anything, but at least it would create more deterministic conditions.

BENCHMARK(robotStateConstruct)->RangeMultiplier(10)->Range(100, 10000)->Unit(benchmark::kMillisecond);
BENCHMARK(robotStateCopy)->RangeMultiplier(10)->Range(100, 10000)->Unit(benchmark::kMillisecond);
BENCHMARK(robotStateUpdate)->RangeMultiplier(10)->Range(10, 1000)->Unit(benchmark::kMillisecond);
BENCHMARK(robotStateForwardKinematics)->RangeMultiplier(10)->Range(10, 1000)->Unit(benchmark::kMillisecond);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment. I think it still makes sense to keep this one, but maybe rename to robotStateGetGlobalLinkTransform, since that's what it calls?

Address review comments
affine.matrix() = isometry.matrix();

Eigen::Affine3d affine = createTestIsometry();
Eigen::Affine3d result;
for (auto _ : st)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about keeping the local variable and the internal loop?
It is very suspicious that the inverse of a Matrix4d is way cheaper than the affine inverse.

moveit_core/robot_state/test/robot_state_benchmark.cpp Outdated Show resolved Hide resolved
BENCHMARK(multiplyMatrixTimesMatrixNoAlias);
BENCHMARK(multiplyIsometryTimesIsometryNoAlias);
BENCHMARK(multiplyMatrixTimesMatrix);
BENCHMARK(multiplyIsometryTimesIsometry);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know about the .noalias(). Unfortunately the default behavior is much more frequent in code, so it would be great to keep both versions of the three multiply benchmarks.

@henningkayser henningkayser added discuss Highlight for discussion in Standup or Maintainer Meeting and removed discuss Highlight for discussion in Standup or Maintainer Meeting labels Jan 23, 2024
Co-authored-by: Mario Prats <mario.prats@picknik.ai>
Co-authored-by: Mario Prats <mario.prats@picknik.ai>
@rhaschke
Copy link
Contributor Author

Is there anything more to do for this to get merged, @marioprats?

@marioprats
Copy link
Contributor

I think we can merge this, thanks Robert!

@henningkayser henningkayser merged commit fbad1c2 into moveit:main Feb 13, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants