Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sse4 if bmi2 is enabled #2300

Closed
wants to merge 2 commits into from

Conversation

vondele
Copy link
Member

@vondele vondele commented Sep 13, 2019

the only change done to the code base to get a somewhat faster binary as discussed in #2291 is to add -msse4 to the compile options of the bmi2 build. Since all processors supporting bmi2 also support sse4 this can be done easily. It is a useful step to avoid sending around custom and poorly tested builds.

The speedup isn't enough to pass [0,4]:
LLR: -2.95 (-2.94,2.94) [0.00,4.00]
Total: 93009 W: 20519 L: 20316 D: 52174
But it is roughly 1.15Elo and a LOS of 90%

No functional change.

@snicolet
Copy link
Member

I have tested this on my Mac, it compiles OK.

@Krgp
Copy link

Krgp commented Sep 13, 2019

I had tried that and It was observed that -msse4 doesn't work on AMD Cpu. Iirc the official static binary of SF8 was required to be changed due to the complaint by Graham on Talk Chess Forum.

@crossbr
Copy link

crossbr commented Sep 13, 2019

@Krgp If you look at Vondele's SPRT test, my machine (an AMD Ryzen 7) completed 500 games without any problem. (See run 81 and 257). I know it is a small sample size, but I wonder whether this is still an issue for AMD cpus. Perhaps we could verify this.

@vondele
Copy link
Member Author

vondele commented Sep 13, 2019

If somebody has a link to the talkchess thread we should be able to figure out.

@MichaelB7
Copy link
Contributor

SSE4 has been supported by both Intel and AMD processors released since late 2007.

Source:
https://en.m.wikipedia.org/wiki/SSE4

@vondele
Copy link
Member Author

vondele commented Sep 13, 2019

I realized the resolution of this is actually easy. Just document that bmi2 also enables sse4. I didn't find any CPU that has bmi2 but not sse4, the minority owning one of those will need to read the docs.

I also reordered the order of architectures for x86-64 to have the more performant build one on top.

@snicolet snicolet closed this in db00e16 Sep 14, 2019
@snicolet
Copy link
Member

Merged via db00e16, thanks!

@ppigazzini
Copy link
Contributor

@vondele on a xeon 48 threads (not the best CPU to test the speedup) the speedup is very little

$ bash bench-parallel.sh ./stockfish_msys2_ss3.exe ./stockfish_msys2_ss4.exe 100
base =    1802654 +/- 10464
test =    1804985 +/- 10492
diff =       2330 +/- 3493
speedup = 0.001293

@ppigazzini
Copy link
Contributor

@vondele some food for thoughts on compiler flags:

  • in Makefile mingw lacks -pedantic -m64 flags wrt gcc . No speedup, just consistency
  • I don't know if -msee4 supersede the older ones (-msse3, -msse2), in that case we should delete -msee
  • my Intel Sandy Bridge support up to sse4.2
  • -march=native does a 'deterministic' speedup (but very little: 0.6-0.8%) and it should enable all these flags:
$ $ gcc -Q -march=native --help=target | grep -v "\[disabled\]"
The following options are target specific:
  -m128bit-long-double                  [enabled]
  -m64                                  [enabled]
  -m80387                               [enabled]
  -mabi=                                sysv
  -maddress-mode=                       long
  -maes                                 [enabled]
  -malign-data=                         compat
  -malign-functions=                    0
  -malign-jumps=                        0
  -malign-loops=                        0
  -malign-stringops                     [enabled]
  -march=                               sandybridge
  -masm=                                att
  -mavx                                 [enabled]
  -mavx256-split-unaligned-load         [enabled]
  -mavx256-split-unaligned-store        [enabled]
  -mbranch-cost=                        3
  -mcmodel=                             [default]
  -mcpu=
  -mcx16                                [enabled]
  -mfancy-math-387                      [enabled]
  -mfp-ret-in-387                       [enabled]
  -mfpmath=                             sse
  -mfunction-return=                    keep
  -mfused-madd
  -mfxsr                                [enabled]
  -mglibc                               [enabled]
  -mhard-float                          [enabled]
  -mieee-fp                             [enabled]
  -mincoming-stack-boundary=            0
  -mindirect-branch=                    keep
  -mintel-syntax
  -mlarge-data-threshold=<number>       65536
  -mlong-double-80                      [enabled]
  -mmemcpy-strategy=
  -mmemset-strategy=
  -mmmx                                 [enabled]
  -mpclmul                              [enabled]
  -mpopcnt                              [enabled]
  -mpreferred-stack-boundary=           0
  -mpush-args                           [enabled]
  -mrecip=
  -mred-zone                            [enabled]
  -mregparm=                            6
  -msahf                                [enabled]
  -msse                                 [enabled]
  -msse2                                [enabled]
  -msse3                                [enabled]
  -msse4                                [enabled]
  -msse4.1                              [enabled]
  -msse4.2                              [enabled]
  -msse5
  -mssse3                               [enabled]
  -mstack-protector-guard=              tls
  -mstringop-strategy=                  [default]
  -mstv                                 [enabled]
  -mtls-dialect=                        gnu
  -mtls-direct-seg-refs                 [enabled]
  -mtune-ctrl=
  -mtune=                               sandybridge
  -mveclibabi=                          [default]
  -mvzeroupper                          [enabled]
  -mxsave                               [enabled]
  -mxsaveopt                            [enabled]

  Known assembler dialects (for use with the -masm= option):
    att intel

  Known ABIs (for use with the -mabi= option):
    ms sysv

  Known code models (for use with the -mcmodel= option):
    32 kernel large medium small

  Valid arguments to -mfpmath=:
    387 387+sse 387,sse both sse sse+387 sse,387

  Known indirect branch choices (for use with the -mindirect-branch=/-mfunction-return= options):
    keep thunk thunk-extern thunk-inline

  Known data alignment choices (for use with the -malign-data= option):
    abi cacheline compat

  Known vectorization library ABIs (for use with the -mveclibabi= option):
    acml svml

  Known address mode (for use with the -maddress-mode= option):
    long short

  Known stack protector guard (for use with the -mstack-protector-guard= option):
    global tls

  Valid arguments to -mstringop-strategy=:
    byte_loop libcall loop rep_4byte rep_8byte rep_byte unrolled_loop vector_loop

  Known TLS dialects (for use with the -mtls-dialect= option):
    gnu gnu2

@vondele
Copy link
Member Author

vondele commented Sep 28, 2019

@ppigazzini one option could be to switch to -march=native for the default, and detect available instructions. At least on x86 with gcc this is quite easy.

To generate binaries that can be redistributed, one might still need to target a specific architecture, and the list is long https://gcc.gnu.org/onlinedocs/gcc-9.2.0/gcc/x86-Options.html#x86-Options

@ppigazzini
Copy link
Contributor

@vondele -march=native should works on every OS supported by gcc.
The problem is the distribution because a build w/ -march=native could not work on other CPU types.

IMO adding a -msse4 for the 'bmi2 CPU' build raises several doubts:

  • no speedup in my test (but w/ a Xeon 48 thread)
  • according the gcc link any 'modern CPU' supports up to -msse4.2
  • the correct way could be adding all the flags -msse -msse2 -msse3 -msse4
  • not necessary Makefile complication

@vondele
Copy link
Member Author

vondele commented Sep 28, 2019

looking at gcc -Q -msse4 --help=target | grep -v "\[disabled\]", adding -msse4 will enable all of

  -msse                       		[enabled]
  -msse2                      		[enabled]
  -msse3                      		[enabled]
  -msse4                      		[enabled]
  -msse4.1                    		[enabled]
  -msse4.2                    		[enabled]
  -mssse3                     		[enabled]

so at least that's consistent. For speedups, I don't know. It might be pretty specific for the processor version.

@ppigazzini
Copy link
Contributor

@vondele Makefile could be simplified enabling only the minimum number of flags (-mpopcnt flags are a subset of -msse4 flags). Perhaps better to open a Makefile issue, to make some tests and to solve also the problem of the compiler info #2327 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants