Support broader range of CPUs like Intel Core 2 series by using SSE4.1 #78

mrezai · 2022-03-04T11:42:36Z

Because popcnt is the only instruction from SSE4.2 that used in project, I add a software implementation for it and add cmake options for more flexible builds.
All options are ON now and project builds like before. If all options set to OFF then project will be compatible with SSE4.1.
I only test it with Clang. I couldn't test it with Visual Studio so I didn't implement cmake options for it.

Resources related to popcount implementation:
https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer
https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
https://godbolt.org/z/qGdh1dvKK

Tests Results:
SSE4.1:

Running scene: ConvexVsMesh
Motion Quality, Thread Count, Steps / Second, Hash
Discrete, 1, 243.629, 5716752184027160728
Discrete, 2, 422.97, 5716752184027160728
Discrete, 3, 582.053, 5716752184027160728
Discrete, 4, 714.728, 5716752184027160728
Discrete, 5, 791.58, 5716752184027160728
Discrete, 6, 858.247, 5716752184027160728
Discrete, 7, 920.468, 5716752184027160728
Discrete, 8, 950.534, 5716752184027160728
LinearCast, 1, 247.095, 4650164293306922608
LinearCast, 2, 427.573, 4650164293306922608
LinearCast, 3, 577.239, 4650164293306922608
LinearCast, 4, 665.414, 4650164293306922608
LinearCast, 5, 793.486, 4650164293306922608
LinearCast, 6, 858.8, 4650164293306922608
LinearCast, 7, 922.867, 4650164293306922608
LinearCast, 8, 967.974, 4650164293306922608

SSE4.2(popcnt):

Running scene: ConvexVsMesh
Motion Quality, Thread Count, Steps / Second, Hash
Discrete, 1, 245.164, 5716752184027160728
Discrete, 2, 432.407, 5716752184027160728
Discrete, 3, 588.171, 5716752184027160728
Discrete, 4, 669.855, 5716752184027160728
Discrete, 5, 816.022, 5716752184027160728
Discrete, 6, 888.699, 5716752184027160728
Discrete, 7, 925.805, 5716752184027160728
Discrete, 8, 990.85, 5716752184027160728
LinearCast, 1, 250.708, 4650164293306922608
LinearCast, 2, 432.997, 4650164293306922608
LinearCast, 3, 591.494, 4650164293306922608
LinearCast, 4, 666.552, 4650164293306922608
LinearCast, 5, 810.698, 4650164293306922608
LinearCast, 6, 874.057, 4650164293306922608
LinearCast, 7, 944.47, 4650164293306922608
LinearCast, 8, 998.861, 4650164293306922608

All options ON:

Running scene: ConvexVsMesh
Motion Quality, Thread Count, Steps / Second, Hash
Discrete, 1, 259.555, 17307524469906541054
Discrete, 2, 451.237, 17307524469906541054
Discrete, 3, 602.937, 17307524469906541054
Discrete, 4, 692.069, 17307524469906541054
Discrete, 5, 791.086, 17307524469906541054
Discrete, 6, 922.998, 17307524469906541054
Discrete, 7, 970.262, 17307524469906541054
Discrete, 8, 1039.1, 17307524469906541054
LinearCast, 1, 259.9, 14934805612666754675
LinearCast, 2, 437.229, 14934805612666754675
LinearCast, 3, 609.016, 14934805612666754675
LinearCast, 4, 710.129, 14934805612666754675
LinearCast, 5, 838.243, 14934805612666754675
LinearCast, 6, 900.946, 14934805612666754675
LinearCast, 7, 982.597, 14934805612666754675
LinearCast, 8, 1025.07, 14934805612666754675

Add compile options to cmake

jrouwe · 2022-03-04T16:07:59Z

Hello, thanks for the change! I just tested it on MSVC and it turns off popcount because MSVC doesn't define __SSE4_2__. I'll take a look at how this can be fixed later.

mrezai · 2022-03-04T16:47:53Z

I think we can use add_compile_definitions like this to solve the problem:

cmake_minimum_required(VERSION 3.15 FATAL_ERROR)

project(JoltPhysics CXX)

option(USE_SSE4_2 "Enable SSE4.2" ON)
option(USE_AVX2 "Enable AVX2" ON)
option(USE_LZCNT "Enable LZCNT" ON)
option(USE_TZCNT "Enable TZCNT" ON)
option(USE_F16C "Enable F16C" ON)
option(USE_FMADD "Enable FMADD" ON)

if (USE_SSE4_2)
	add_compile_definitions(JPH_USE_SSE4_2)
endif()
if (USE_AVX2)
	add_compile_definitions(JPH_USE_AVX2)
endif()
if (USE_LZCNT)
	add_compile_definitions(JPH_USE_LZCNT)
endif()
if (USE_TZCNT)
	add_compile_definitions(JPH_USE_TZCNT)
endif()
if (USE_F16C)
	add_compile_definitions(JPH_USE_F16C)
endif()
if (USE_FMADD)
	add_compile_definitions(JPH_USE_FMADD)
endif()

jrouwe · 2022-03-04T16:50:34Z

I fixed the MSVC build, but while playing around with the various compiler options I found out that MSVC2022 only supports compiling for AVX and AVX2 (/arch:SSE4 doesn't even exist as a commandline option, /arch:SSE2 gives a warning that the option is unknown). So I think the changes in Core.h and Math.h are not needed as you will never be able to trigger them. On Clang everything goes through '__builtin_popcount' and only on MSVC there's a difference. Shall I remove them (or did you actually intend your version of popcount to be executed on Linux too)?

mrezai · 2022-03-04T17:59:11Z

Do you mean without one of AVX or AVX2, library can't be compiled with MSVC? If that's the case then your change and suggestion is ok, but if not then what about my previous message to using "add_compile_definitions" to enable/disable features explicitly?
I mean to support SSE4.1 on MSVC, add another option for avx. If all options be disabled like below then software implementation of popcount will be enabled for MSVC:

option(USE_SSE4_2 "Enable SSE4.2" OFF)
option(USE_AVX "Enable AVX" OFF)
option(USE_AVX2 "Enable AVX2" OFF)

...

if (USE_SSE4_2)
	add_compile_definitions(JPH_USE_SSE4_2)
endif()
if (USE_AVX)
	add_compile_definitions(JPH_USE_AVX)
endif()
if (USE_AVX2)
	add_compile_definitions(JPH_USE_AVX2)
endif()

...

	if ("${CMAKE_CXX_COMPILER_ID}" STREQUAL "MSVC")
		set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /MP /fp:fast") # Clang doesn't use fast math because it cannot be turned off inside a single compilation unit
		if (USE_AVX)
			set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /arch:AVX")
		endif()
		if (USE_AVX2)
			set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /arch:AVX2")
		endif()

- Added support for SSE4.1 in MSVC

jrouwe · 2022-03-04T21:15:28Z

You're right, when no /arch:XXX commandline is supplied MSVC will fall back to SSE2 (the property sheet in MSVC allows you to set /arch:SSE2 but that option gives a warning). Unlike in Clang, you can still use the SSE4 intrinsics in that case, so I think I can indeed support a SSE4.1, SSE4.2, AVX and AVX2 configuration on MSVC and then your code additions make sense.

All processors that support AVX2 also support BMI, POPCNT, LZCNT, M16C and FMA so I'm enabling them all together if you enable AVX2 now. AVX always supports POPCNT so they go together too. And other than that I kept your individual switches in case someone wants to target a processor that doesn't support AVX2 but does support some of the other instructions (i.e. some consoles).

jrouwe · 2022-03-04T21:17:49Z

Let me know if it works for you like this and I'll merge the changes.

mrezai · 2022-03-05T03:16:09Z

Excellent, it seems good to me. Please merge it, we can add additional changes later if something doesn't work correctly.

Support broader range of CPUs like Intel Core 2 series by using SSE4.1

47d7416

Add compile options to cmake

Ability to switch between AVX / AVX2 using MSVC

282b652

- Added support for AVX target

14a0083

- Added support for SSE4.1 in MSVC

jrouwe merged commit 80c0b60 into jrouwe:master Mar 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support broader range of CPUs like Intel Core 2 series by using SSE4.1 #78

Support broader range of CPUs like Intel Core 2 series by using SSE4.1 #78

mrezai commented Mar 4, 2022

jrouwe commented Mar 4, 2022 •

edited

mrezai commented Mar 4, 2022

jrouwe commented Mar 4, 2022

mrezai commented Mar 4, 2022

jrouwe commented Mar 4, 2022

jrouwe commented Mar 4, 2022

mrezai commented Mar 5, 2022

Support broader range of CPUs like Intel Core 2 series by using SSE4.1 #78

Support broader range of CPUs like Intel Core 2 series by using SSE4.1 #78

Conversation

mrezai commented Mar 4, 2022

jrouwe commented Mar 4, 2022 • edited

mrezai commented Mar 4, 2022

jrouwe commented Mar 4, 2022

mrezai commented Mar 4, 2022

jrouwe commented Mar 4, 2022

jrouwe commented Mar 4, 2022

mrezai commented Mar 5, 2022

jrouwe commented Mar 4, 2022 •

edited