Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD/MIMD enhancements discussion #26

Closed
louis-langholtz opened this issue Jul 30, 2017 · 15 comments
Closed

SIMD/MIMD enhancements discussion #26

louis-langholtz opened this issue Jul 30, 2017 · 15 comments
Labels
Enhancement For suggestions or changes that enhance any part of the project and isn't a bug. Help Wanted For things that other people are encouraged to help with.

Comments

@louis-langholtz
Copy link
Owner

Let's use this issue for any conversation regarding SIMD/MIMD enhancements.

@louis-langholtz louis-langholtz added Enhancement For suggestions or changes that enhance any part of the project and isn't a bug. Help Wanted For things that other people are encouraged to help with. labels Jul 30, 2017
@louis-langholtz louis-langholtz added this to In Progress in SIMD/MIMD enhancements Jul 30, 2017
@louis-langholtz
Copy link
Owner Author

Update:

Results have been disappointing so far in terms of seeing any speedups.

I believe this is in part because:

  • the simulations I've run are smaller (perhaps too small to see any noticable benefit); and
  • up until recently PlayRho didn't have any real benchmarking oriented testing.

Now PlayRho has the Benchmark application which uses Google's C++ benchmarking library which gives PlayRho a better basis for any SIMD/MIMD work.

@louis-langholtz
Copy link
Owner Author

Something I think that's important to keep in mind regarding SIMD/MIMD enhancements, is the cache locality or locality of reference problem. Someone could spend weeks writing the most ultimately OpenCL/CUDA code only to see that code's performance blown out of the water by local CPU code that was much better at using memory in cache friendly ways.

I've been looking into how memory is used and have found things out in this regard that were surprising and very interesting to me. Conceptually speaking, getting data to and from external compute resources like the GPU, seems to also suffers from a version of the cache locality problem.

Some of the changes I've made to PlayRho have been data-oriented in nature towards addressing simulation speed based on memory access patterns and optimizing cache locality.

This all gets increasingly hardware specific alarmingly fast to me. For example, would it make more sense to write code explicitly using Streaming SIMD Extensions or spend that time working on improving the optimizations available in the GNU C++ and/or Clang compiler?

Anyways... stuff I've been thinking about related to this.

@Alan-FGR
Copy link

Alan-FGR commented Aug 1, 2017

It's funny how Chipmunk not only scales quite considerably when enabling multi-threading (using the hasty space), but also it benefits from hw acceleration on mobile platforms. According to @Genbox though the improvements in Box2D were marginal at best or negative, what is really puzzling, but then again they're quite different overall.

@Genbox
Copy link

Genbox commented Aug 1, 2017

@louis-langholtz I also found that branch prediction misses where a big hit in the solver code. I'm sure some clever arithmetic could produce a less branchy version or even make some of it completely branchless, which would optimize the CPU prefetching.

I think the solver code is the best candidate for SIMD since it gets large blocks of data and uses matrix heavy options, which can be done in SIMD fashion. The trick is to do as many operations as possible to fill out the SIMD register, which can be up to 512 bits with AVX-512. However, you are right that it makes the code very hardware specific and very difficult to port, so I would only do it for the hot paths in the solver or maybe narrow phase.

@Genbox
Copy link

Genbox commented Aug 1, 2017

Note: Intel VTune is amazing for analyzing low level operations on the CPU

@louis-langholtz
Copy link
Owner Author

@Genbox Any chance you can take a screen capture or something similar of the branch prediction miss info? Is that from running Intel VTune on PlayRho or on Erin's Box2D code or some other solver code? I'd love to review that info since I've tested some branch eliminations in what I consider solver code and saw no reliably repeatable speedups. I've also had some potential compile-time optimizations in mind that would eliminate some branching so any branch miss info I could get/see I think would be very helpful. Incidentally, I'd love to get my hands on VTune even for the 30-day trial period but I have some hardware resource limits that would make using it harder.

OTOH, there's sin/cos calculations going on inside the world step code path which I've been considering replacing instead with unit vector math (using the UnitVec2 type). std::sin and std::cos ops are significantly slower than the std::sqrt operation which was a surprise to me to see.

In any case, I'd like to increase the benchmarks done especially to fill in the gap between the lowest level code (like floating point division times) and highest level code (like times for whole simulations coming to rest).

@Genbox
Copy link

Genbox commented Aug 2, 2017

I used it on Velcro with the 30 day evaluation. It does not care about the programming language since it focuses on the instructions on the CPU and not much else. What kind of hardware limits are we talking about?

Replace sin/cos where? In the b2Rot class? Sin and cos should be implemented as instructions by the compiler. All CPUs should have the FSIN and FCOS instructions which nowaday is implemented like a lookup table with extrapolation for extra precision if I remember correctly. That means they are not going to get any faster, so if you replace them (also see this) with sqrt, it would gain a speedup.

On the topic of sin/cos, I did this little thingy some years ago. The CPU instructions should be much faster than those nowadays.

@louis-langholtz
Copy link
Owner Author

louis-langholtz commented Aug 2, 2017

@Genbox Replace sin/cos calculations being done inside the world step. More specifically inside the regular phase and the TOI phase processing. You can see these calculations occurring for every body that's been islanded as part of the broad-phase synchronization.

I'm not sure that the calculations can be replaced with UnitVec2-math (b2Rot-math in Erin's Box2D). But if it can be, I think the extra four bytes of data size would be worth it if it results in a net speed-up of the world step method (which I suspect it would). Another cost (besides having to store a total of 8-bytes instead of just 4), would be that location information would no longer be directly accessible to the user as an angle. I think that'd be a fair trade-off though as users really don't need to know how many times a body has turned, only what orientation it's at.

This would also solve the problem of normalizing the angle.

@louis-langholtz
Copy link
Owner Author

In case no one has linked it already... https://github.com/wolfviking0/webcl-box2d is an OpenCL accelerated implementation of Box2D. No idea of how much that implementation helps but maybe it can provide helpful insight from a conceptual/API level.

@louis-langholtz
Copy link
Owner Author

AsyncFutureDeferred                      274 ns        274 ns    2564920
AsyncFutureAsync                       21610 ns      19293 ns      36708
ThreadCreateAndDestroy                 24111 ns      18605 ns      37631
MultiThreadQD                          13942 ns       7346 ns      98785

These are timing results from the Benchmark application for using various mechanisms from the C++ Thread Support Library.

By comparison, here's the results I get for 1000 single-precision divisions and 1000 double-precision std::atan2 calls (using random data generated outside of the timing):

FloatDiv/1000                           1986 ns       1986 ns     343762
DoubleAtan2/1000                       25093 ns      25088 ns      27938

What I make of these data points based on the code they're testing, is that the time cost of using any blocking operation from the thread support library is in the ballpark of doing around 7,000 single precision floating point division operations or some 500 std::atan2(double, double) operations. This is on a box running macOS.

Now that I have cmake building working as well as it does now for Windows, I'll try getting the Benchmark building and running on Windows to get numbers for that platform.

@ninnghazad
Copy link
Contributor

Just dropping this https://github.com/QuantStack/xsimd into the discussion. It's an abstraction wrapper for SIMD operations across the common instruction sets. Came across this at work and found it quite usefull. Also tried other wrappers, but i found this one to have the nicest API. With all wrappers,as well of course using SIMD instructions directly (which would advice against), it isn't just a matter of dropping in the headers and expecting it to just work. However, using SSE* or even better AVX through this wrapper offers some serious speedups. In how far these are applicable to PlayRho's calculations i do not know - and it might well be only worth it for simulations with a larger number of objects.
Basically stored floats/doubles would have to be aligned (wrapper contains simple alignment allocator for STL), and then single operations can be rewritten to utilise SIMD registers by loading from the aligned storage and using the provided datatypes' operators. So it is not needed to completely rewrite basic layout of programs to leverage these capabilities.

@Alan-FGR
Copy link

Alan-FGR commented Jan 20, 2018

Interesting link.
Yeah, when it comes to SIMD it's not just a matter of replacing ops, you have to mind your data and submit appropriately. Another point is mobile support I guess, I'm not sure whether it would be practical (or possible) to use a library that supports (abstracts) multiple architectures when it comes to hardware acceleration and I think mobile is where that's really going to make a huge difference. But then again, I'm personally not interested on mobiles and I'm not sure if that's priority for @louis-langholtz.

@ninnghazad
Copy link
Contributor

Xsimd does offer ARM NEON support i think, but i too am not interested in mobile. i use it on xeon cpus and it does make a big difference where applicable. Where applicable is key here. Don't know if it is the right path for playrho, but should simd be integrated, i don't think using any instructionset without abstraction would be a good idea.

@moonheart08
Copy link

Just going to note, because it hasn't been mentioned, but it's possible to force a data line into the cache using prefetch instructions.
GCC/Clang even has a builtin for it, __builtin_prefetch

@louis-langholtz
Copy link
Owner Author

With pull request #346 merged now, I'm excited to recognize that the compilation firewall and switch to value semantics it introduced will facilitate more experimentation with this and other optimization ideas. Hooray!!

SIMD/MIMD enhancements automation moved this from In Progress to Done May 5, 2021
Repository owner locked and limited conversation to collaborators May 5, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Enhancement For suggestions or changes that enhance any part of the project and isn't a bug. Help Wanted For things that other people are encouraged to help with.
Projects
Development

No branches or pull requests

5 participants