Add sse4 if bmi2 is enabled #2300

vondele · 2019-09-13T05:06:31Z

the only change done to the code base to get a somewhat faster binary as discussed in #2291 is to add -msse4 to the compile options of the bmi2 build. Since all processors supporting bmi2 also support sse4 this can be done easily. It is a useful step to avoid sending around custom and poorly tested builds.

The speedup isn't enough to pass [0,4]:
LLR: -2.95 (-2.94,2.94) [0.00,4.00]
Total: 93009 W: 20519 L: 20316 D: 52174
But it is roughly 1.15Elo and a LOS of 90%

No functional change.

snicolet · 2019-09-13T08:55:16Z

I have tested this on my Mac, it compiles OK.

Krgp · 2019-09-13T11:07:02Z

I had tried that and It was observed that -msse4 doesn't work on AMD Cpu. Iirc the official static binary of SF8 was required to be changed due to the complaint by Graham on Talk Chess Forum.

crossbr · 2019-09-13T11:20:42Z

@Krgp If you look at Vondele's SPRT test, my machine (an AMD Ryzen 7) completed 500 games without any problem. (See run 81 and 257). I know it is a small sample size, but I wonder whether this is still an issue for AMD cpus. Perhaps we could verify this.

vondele · 2019-09-13T11:31:20Z

If somebody has a link to the talkchess thread we should be able to figure out.

MichaelB7 · 2019-09-13T11:50:40Z

SSE4 has been supported by both Intel and AMD processors released since late 2007.

Source:
https://en.m.wikipedia.org/wiki/SSE4

vondele · 2019-09-13T15:20:07Z

I realized the resolution of this is actually easy. Just document that bmi2 also enables sse4. I didn't find any CPU that has bmi2 but not sse4, the minority owning one of those will need to read the docs.

I also reordered the order of architectures for x86-64 to have the more performant build one on top.

snicolet · 2019-09-14T05:20:30Z

Merged via db00e16, thanks!

ppigazzini · 2019-09-17T15:44:20Z

@vondele on a xeon 48 threads (not the best CPU to test the speedup) the speedup is very little

$ bash bench-parallel.sh ./stockfish_msys2_ss3.exe ./stockfish_msys2_ss4.exe 100
base =    1802654 +/- 10464
test =    1804985 +/- 10492
diff =       2330 +/- 3493
speedup = 0.001293

ppigazzini · 2019-09-28T11:05:33Z

@vondele some food for thoughts on compiler flags:

in Makefile mingw lacks -pedantic -m64 flags wrt gcc . No speedup, just consistency
I don't know if -msee4 supersede the older ones (-msse3, -msse2), in that case we should delete -msee
my Intel Sandy Bridge support up to sse4.2
-march=native does a 'deterministic' speedup (but very little: 0.6-0.8%) and it should enable all these flags:

$ $ gcc -Q -march=native --help=target | grep -v "\[disabled\]"
The following options are target specific:
  -m128bit-long-double                  [enabled]
  -m64                                  [enabled]
  -m80387                               [enabled]
  -mabi=                                sysv
  -maddress-mode=                       long
  -maes                                 [enabled]
  -malign-data=                         compat
  -malign-functions=                    0
  -malign-jumps=                        0
  -malign-loops=                        0
  -malign-stringops                     [enabled]
  -march=                               sandybridge
  -masm=                                att
  -mavx                                 [enabled]
  -mavx256-split-unaligned-load         [enabled]
  -mavx256-split-unaligned-store        [enabled]
  -mbranch-cost=                        3
  -mcmodel=                             [default]
  -mcpu=
  -mcx16                                [enabled]
  -mfancy-math-387                      [enabled]
  -mfp-ret-in-387                       [enabled]
  -mfpmath=                             sse
  -mfunction-return=                    keep
  -mfused-madd
  -mfxsr                                [enabled]
  -mglibc                               [enabled]
  -mhard-float                          [enabled]
  -mieee-fp                             [enabled]
  -mincoming-stack-boundary=            0
  -mindirect-branch=                    keep
  -mintel-syntax
  -mlarge-data-threshold=<number>       65536
  -mlong-double-80                      [enabled]
  -mmemcpy-strategy=
  -mmemset-strategy=
  -mmmx                                 [enabled]
  -mpclmul                              [enabled]
  -mpopcnt                              [enabled]
  -mpreferred-stack-boundary=           0
  -mpush-args                           [enabled]
  -mrecip=
  -mred-zone                            [enabled]
  -mregparm=                            6
  -msahf                                [enabled]
  -msse                                 [enabled]
  -msse2                                [enabled]
  -msse3                                [enabled]
  -msse4                                [enabled]
  -msse4.1                              [enabled]
  -msse4.2                              [enabled]
  -msse5
  -mssse3                               [enabled]
  -mstack-protector-guard=              tls
  -mstringop-strategy=                  [default]
  -mstv                                 [enabled]
  -mtls-dialect=                        gnu
  -mtls-direct-seg-refs                 [enabled]
  -mtune-ctrl=
  -mtune=                               sandybridge
  -mveclibabi=                          [default]
  -mvzeroupper                          [enabled]
  -mxsave                               [enabled]
  -mxsaveopt                            [enabled]

  Known assembler dialects (for use with the -masm= option):
    att intel

  Known ABIs (for use with the -mabi= option):
    ms sysv

  Known code models (for use with the -mcmodel= option):
    32 kernel large medium small

  Valid arguments to -mfpmath=:
    387 387+sse 387,sse both sse sse+387 sse,387

  Known indirect branch choices (for use with the -mindirect-branch=/-mfunction-return= options):
    keep thunk thunk-extern thunk-inline

  Known data alignment choices (for use with the -malign-data= option):
    abi cacheline compat

  Known vectorization library ABIs (for use with the -mveclibabi= option):
    acml svml

  Known address mode (for use with the -maddress-mode= option):
    long short

  Known stack protector guard (for use with the -mstack-protector-guard= option):
    global tls

  Valid arguments to -mstringop-strategy=:
    byte_loop libcall loop rep_4byte rep_8byte rep_byte unrolled_loop vector_loop

  Known TLS dialects (for use with the -mtls-dialect= option):
    gnu gnu2

vondele · 2019-09-28T12:14:49Z

@ppigazzini one option could be to switch to -march=native for the default, and detect available instructions. At least on x86 with gcc this is quite easy.

To generate binaries that can be redistributed, one might still need to target a specific architecture, and the list is long https://gcc.gnu.org/onlinedocs/gcc-9.2.0/gcc/x86-Options.html#x86-Options

ppigazzini · 2019-09-28T13:03:12Z

@vondele -march=native should works on every OS supported by gcc.
The problem is the distribution because a build w/ -march=native could not work on other CPU types.

IMO adding a -msse4 for the 'bmi2 CPU' build raises several doubts:

no speedup in my test (but w/ a Xeon 48 thread)
according the gcc link any 'modern CPU' supports up to -msse4.2
the correct way could be adding all the flags -msse -msse2 -msse3 -msse4
not necessary Makefile complication

vondele · 2019-09-28T13:24:43Z

looking at gcc -Q -msse4 --help=target | grep -v "\[disabled\]", adding -msse4 will enable all of

  -msse                       		[enabled]
  -msse2                      		[enabled]
  -msse3                      		[enabled]
  -msse4                      		[enabled]
  -msse4.1                    		[enabled]
  -msse4.2                    		[enabled]
  -mssse3                     		[enabled]

so at least that's consistent. For speedups, I don't know. It might be pretty specific for the processor version.

ppigazzini · 2019-09-28T14:10:56Z

@vondele Makefile could be simplified enabling only the minimum number of flags (-mpopcnt flags are a subset of -msse4 flags). Perhaps better to open a Makefile issue, to make some tests and to solve also the problem of the compiler info #2327 (comment)

Add sse4 if bmi2

affba9b

Update docs

299105a

snicolet closed this in db00e16 Sep 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sse4 if bmi2 is enabled #2300

Add sse4 if bmi2 is enabled #2300

vondele commented Sep 13, 2019

snicolet commented Sep 13, 2019

Krgp commented Sep 13, 2019

crossbr commented Sep 13, 2019

vondele commented Sep 13, 2019

MichaelB7 commented Sep 13, 2019

vondele commented Sep 13, 2019

snicolet commented Sep 14, 2019

ppigazzini commented Sep 17, 2019

ppigazzini commented Sep 28, 2019

vondele commented Sep 28, 2019

ppigazzini commented Sep 28, 2019

vondele commented Sep 28, 2019

ppigazzini commented Sep 28, 2019

Add sse4 if bmi2 is enabled #2300

Add sse4 if bmi2 is enabled #2300

Conversation

vondele commented Sep 13, 2019

snicolet commented Sep 13, 2019

Krgp commented Sep 13, 2019

crossbr commented Sep 13, 2019

vondele commented Sep 13, 2019

MichaelB7 commented Sep 13, 2019

vondele commented Sep 13, 2019

snicolet commented Sep 14, 2019

ppigazzini commented Sep 17, 2019

ppigazzini commented Sep 28, 2019

vondele commented Sep 28, 2019

ppigazzini commented Sep 28, 2019

vondele commented Sep 28, 2019

ppigazzini commented Sep 28, 2019