Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support broader range of CPUs like Intel Core 2 series by using SSE4.1 #78

Merged
merged 3 commits into from Mar 5, 2022

Conversation

mrezai
Copy link
Contributor

@mrezai mrezai commented Mar 4, 2022

Because popcnt is the only instruction from SSE4.2 that used in project, I add a software implementation for it and add cmake options for more flexible builds.
All options are ON now and project builds like before. If all options set to OFF then project will be compatible with SSE4.1.
I only test it with Clang. I couldn't test it with Visual Studio so I didn't implement cmake options for it.

Resources related to popcount implementation:
https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer
https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
https://godbolt.org/z/qGdh1dvKK

Tests Results:
SSE4.1:

Running scene: ConvexVsMesh
Motion Quality, Thread Count, Steps / Second, Hash
Discrete, 1, 243.629, 5716752184027160728
Discrete, 2, 422.97, 5716752184027160728
Discrete, 3, 582.053, 5716752184027160728
Discrete, 4, 714.728, 5716752184027160728
Discrete, 5, 791.58, 5716752184027160728
Discrete, 6, 858.247, 5716752184027160728
Discrete, 7, 920.468, 5716752184027160728
Discrete, 8, 950.534, 5716752184027160728
LinearCast, 1, 247.095, 4650164293306922608
LinearCast, 2, 427.573, 4650164293306922608
LinearCast, 3, 577.239, 4650164293306922608
LinearCast, 4, 665.414, 4650164293306922608
LinearCast, 5, 793.486, 4650164293306922608
LinearCast, 6, 858.8, 4650164293306922608
LinearCast, 7, 922.867, 4650164293306922608
LinearCast, 8, 967.974, 4650164293306922608

SSE4.2(popcnt):

Running scene: ConvexVsMesh
Motion Quality, Thread Count, Steps / Second, Hash
Discrete, 1, 245.164, 5716752184027160728
Discrete, 2, 432.407, 5716752184027160728
Discrete, 3, 588.171, 5716752184027160728
Discrete, 4, 669.855, 5716752184027160728
Discrete, 5, 816.022, 5716752184027160728
Discrete, 6, 888.699, 5716752184027160728
Discrete, 7, 925.805, 5716752184027160728
Discrete, 8, 990.85, 5716752184027160728
LinearCast, 1, 250.708, 4650164293306922608
LinearCast, 2, 432.997, 4650164293306922608
LinearCast, 3, 591.494, 4650164293306922608
LinearCast, 4, 666.552, 4650164293306922608
LinearCast, 5, 810.698, 4650164293306922608
LinearCast, 6, 874.057, 4650164293306922608
LinearCast, 7, 944.47, 4650164293306922608
LinearCast, 8, 998.861, 4650164293306922608

All options ON:

Running scene: ConvexVsMesh
Motion Quality, Thread Count, Steps / Second, Hash
Discrete, 1, 259.555, 17307524469906541054
Discrete, 2, 451.237, 17307524469906541054
Discrete, 3, 602.937, 17307524469906541054
Discrete, 4, 692.069, 17307524469906541054
Discrete, 5, 791.086, 17307524469906541054
Discrete, 6, 922.998, 17307524469906541054
Discrete, 7, 970.262, 17307524469906541054
Discrete, 8, 1039.1, 17307524469906541054
LinearCast, 1, 259.9, 14934805612666754675
LinearCast, 2, 437.229, 14934805612666754675
LinearCast, 3, 609.016, 14934805612666754675
LinearCast, 4, 710.129, 14934805612666754675
LinearCast, 5, 838.243, 14934805612666754675
LinearCast, 6, 900.946, 14934805612666754675
LinearCast, 7, 982.597, 14934805612666754675
LinearCast, 8, 1025.07, 14934805612666754675

@jrouwe
Copy link
Owner

jrouwe commented Mar 4, 2022

Hello, thanks for the change! I just tested it on MSVC and it turns off popcount because MSVC doesn't define __SSE4_2__. I'll take a look at how this can be fixed later.

@mrezai
Copy link
Contributor Author

mrezai commented Mar 4, 2022

I think we can use add_compile_definitions like this to solve the problem:

cmake_minimum_required(VERSION 3.15 FATAL_ERROR)

project(JoltPhysics CXX)

option(USE_SSE4_2 "Enable SSE4.2" ON)
option(USE_AVX2 "Enable AVX2" ON)
option(USE_LZCNT "Enable LZCNT" ON)
option(USE_TZCNT "Enable TZCNT" ON)
option(USE_F16C "Enable F16C" ON)
option(USE_FMADD "Enable FMADD" ON)

if (USE_SSE4_2)
	add_compile_definitions(JPH_USE_SSE4_2)
endif()
if (USE_AVX2)
	add_compile_definitions(JPH_USE_AVX2)
endif()
if (USE_LZCNT)
	add_compile_definitions(JPH_USE_LZCNT)
endif()
if (USE_TZCNT)
	add_compile_definitions(JPH_USE_TZCNT)
endif()
if (USE_F16C)
	add_compile_definitions(JPH_USE_F16C)
endif()
if (USE_FMADD)
	add_compile_definitions(JPH_USE_FMADD)
endif()

@jrouwe
Copy link
Owner

jrouwe commented Mar 4, 2022

I fixed the MSVC build, but while playing around with the various compiler options I found out that MSVC2022 only supports compiling for AVX and AVX2 (/arch:SSE4 doesn't even exist as a commandline option, /arch:SSE2 gives a warning that the option is unknown). So I think the changes in Core.h and Math.h are not needed as you will never be able to trigger them. On Clang everything goes through '__builtin_popcount' and only on MSVC there's a difference. Shall I remove them (or did you actually intend your version of popcount to be executed on Linux too)?

@mrezai
Copy link
Contributor Author

mrezai commented Mar 4, 2022

Do you mean without one of AVX or AVX2, library can't be compiled with MSVC? If that's the case then your change and suggestion is ok, but if not then what about my previous message to using "add_compile_definitions" to enable/disable features explicitly?
I mean to support SSE4.1 on MSVC, add another option for avx. If all options be disabled like below then software implementation of popcount will be enabled for MSVC:

option(USE_SSE4_2 "Enable SSE4.2" OFF)
option(USE_AVX "Enable AVX" OFF)
option(USE_AVX2 "Enable AVX2" OFF)

...

if (USE_SSE4_2)
	add_compile_definitions(JPH_USE_SSE4_2)
endif()
if (USE_AVX)
	add_compile_definitions(JPH_USE_AVX)
endif()
if (USE_AVX2)
	add_compile_definitions(JPH_USE_AVX2)
endif()

...

	if ("${CMAKE_CXX_COMPILER_ID}" STREQUAL "MSVC")
		set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /MP /fp:fast") # Clang doesn't use fast math because it cannot be turned off inside a single compilation unit
		if (USE_AVX)
			set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /arch:AVX")
		endif()
		if (USE_AVX2)
			set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /arch:AVX2")
		endif()	

- Added support for SSE4.1 in MSVC
@jrouwe
Copy link
Owner

jrouwe commented Mar 4, 2022

You're right, when no /arch:XXX commandline is supplied MSVC will fall back to SSE2 (the property sheet in MSVC allows you to set /arch:SSE2 but that option gives a warning). Unlike in Clang, you can still use the SSE4 intrinsics in that case, so I think I can indeed support a SSE4.1, SSE4.2, AVX and AVX2 configuration on MSVC and then your code additions make sense.

All processors that support AVX2 also support BMI, POPCNT, LZCNT, M16C and FMA so I'm enabling them all together if you enable AVX2 now. AVX always supports POPCNT so they go together too. And other than that I kept your individual switches in case someone wants to target a processor that doesn't support AVX2 but does support some of the other instructions (i.e. some consoles).

@jrouwe
Copy link
Owner

jrouwe commented Mar 4, 2022

Let me know if it works for you like this and I'll merge the changes.

@mrezai
Copy link
Contributor Author

mrezai commented Mar 5, 2022

Excellent, it seems good to me. Please merge it, we can add additional changes later if something doesn't work correctly.

@jrouwe jrouwe merged commit 80c0b60 into jrouwe:master Mar 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants