Skip to content

Conversation

@caospacifico
Copy link

"AABB::intersects_segment" is optimized with Intel SSE instructions and runs ~2X as fast as the original.

Running:
valgrind --tool=callgrind --toggle-collect=TestAABB::test_aabb* --cache-sim=yes --branch-sim=yes ./bin/godot.x11.64 -test aabb

then:
kcachegrind callgrind.out.*

Normal binary output shows a cycle estimation of 971474:
KCachegrind AABB No SIMD

SIMD optimized binary shows a cycle estimation of 515142:

KCachegrind AABB SIMD

NOTE: Only optimized for GCC, Intel SSE, and single precision floating point! It will compile on other platforms but won't be optimized.

@reduz
Copy link
Member

reduz commented Aug 9, 2015

this is pretty cool, I was planning to do a whole overhaul of the math
stuff using simd after next version, so will most likely use this PR as a
base

On Sun, Aug 9, 2015 at 8:05 PM, jjdicharry notifications@github.com wrote:

"AABB::intersects_segment" is optimized with Intel SSE instructions and
runs ~2X as fast as the original.

Running:
valgrind --tool=callgrind --toggle-collect=TestAABB::test_aabb*
--cache-sim=yes --branch-sim=yes ./bin/godot.x11.64 -test aabb

then:
kcachegrind callgrind.out.*

Normal binary output shows a cycle estimation of 971474:

[image: kcachgrind_aabb_nosimd]
https://cloud.githubusercontent.com/assets/12058304/9157352/4c2e72e8-3eaf-11e5-9e41-9d7ed4dfa429.jpg

SIMD optimized binary shows a cycle estimation of 515142:

[image: kcachegrind_aabb_simd]
https://cloud.githubusercontent.com/assets/12058304/9157365/b08a69fe-3eaf-11e5-913a-e0c420ead69d.png

NOTE: Only optimized for GCC, Intel SSE, and single precision floating

point! It will compile on other platforms but won't be optimized.

You can view, comment on, or merge this pull request online at:

#2351
Commit Summary

  • Fixed resource binary reading for quad vector3.
  • Optimizing Vector3 and add test.
  • Added vector optimization for AABB::intersects_segment.
  • Will handle optimization for double precision later.

File Changes

Patch Links:


Reply to this email directly or view it on GitHub
#2351.

@caospacifico
Copy link
Author

I was trying to make the Vector3 class SIMD optimized but that created more problems because Intel requires SSE operations to be aligned at addresses that are multiples of 16. So I created a Vector3_simd class that I only use in AABB::intersects_segment where I can control its alignment.

@reduz
Copy link
Member

reduz commented Aug 9, 2015

It should be ok to add an extra float, I think I wrote all the code in the
engine with the expectation that an extra one would eventually be added.

On Sun, Aug 9, 2015 at 8:30 PM, jjdicharry notifications@github.com wrote:

I was trying to make the Vector3 class SIMD optimized but that created
more problems because Intel requires SSE operations to be aligned at
address that are multiples of 16. So I created a Vector3_simd class that I
only use in AABB::intersects_segment where I can control its alignment.


Reply to this email directly or view it on GitHub
#2351 (comment).

@caospacifico
Copy link
Author

Thanks for thinking my contribution is cool! It did speed things up quite a bit! Also, I know about the extra float in Vector3. The problem is more complicated than that though. You have "new" and "malloc" which aren't guaranteed to be aligned. Vector3 could be in a struct or class, i.e. "struct {int a, Vector3 v};" that will throw the alignment off by 4 bytes. These are issues that are not impossible to fix but there are a lot of fixes to be made. Another thing is that other contributors in the future would have to consider the alignment when they want to use these SIMD objects. I just wanted to commit to the project without introducing a lot of bugs!

@caospacifico
Copy link
Author

I may have a solution to the alignment issue. The SSE "movups" instruction
can be used to move unaligned data to "xmm" registers which can be used on
each math operation. This way the SSE instructions can be used without
issues. But I don't know if adding "movups" instructions will make every
math operation faster.

On Sun, Aug 9, 2015 at 4:34 PM, Juan Linietsky notifications@github.com
wrote:

It should be ok to add an extra float, I think I wrote all the code in the
engine with the expectation that an extra one would eventually be added.

On Sun, Aug 9, 2015 at 8:30 PM, jjdicharry notifications@github.com
wrote:

I was trying to make the Vector3 class SIMD optimized but that created
more problems because Intel requires SSE operations to be aligned at
address that are multiples of 16. So I created a Vector3_simd class that
I
only use in AABB::intersects_segment where I can control its alignment.


Reply to this email directly or view it on GitHub
#2351 (comment).


Reply to this email directly or view it on GitHub
#2351 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants