Compute optimal register count for feature transformer accumulation dynamically. #3543

Sopel97 · 2021-06-10T16:16:00Z

This also fixes a "bug" where AVX512 would only use 8 registers instead of 16 (now possible due to a 2x increase in FT size).

…ynamically.

vondele · 2021-06-10T17:23:02Z

architectures like 512x24x16 don't compile on master, or with this patch. Is this something that can be fixed in the same round?

Sopel97 · 2021-06-10T17:39:58Z

No. It would need completely different code for the affine transform.

snicolet · 2021-06-11T04:58:46Z

Anybody has speed data for this patch? And fishtest data?

vondele · 2021-06-11T05:04:44Z

I believe this makes essentially no difference at run time, since the constants are computed at compile-time, and are the same except for avx512. For the latter architecture, one should measure a speedup, which might be worth verifying with a bench.

I think it is better than hard-coding the constants as we do now.

snicolet · 2021-06-11T05:11:42Z

For the latter architecture, one should measure a speedup, which might be worth verifying with a bench.

Indeed it might! :-) It is not light to add 48 lines to Stockfish code just for the fun of it, because it feels better and might be a speed gain is some processor which is not mainstream at the moment.

snicolet · 2021-06-11T05:18:38Z

Bottom line is that the complexity and size of the NNUE code part are exploding. We should try to simplify it, not complexify it for the sake of 0.1% unproven speed-ups or algorithmic brilliancy satisfaction.

vondele · 2021-06-11T05:20:13Z

yes, I understand that argument. I think the value of the patch is removing the 'magic constants' for each processor, which are not so obvious, somehow the added code is like documentation that explains how to obtain them.

essentially, there was a mistake in the avx512 constants, because it is non-obvious.

snicolet · 2021-06-11T05:57:31Z

There is as many magic constants in the patch as in master, just they are hidden in the fourth parameter(!) of the templates for calculating "optimal register count".

static constexpr IndexType NumRegs = BestRegCount<vec_t, std::int16_t, TransformedFeatureDimensions, 32>;
static constexpr IndexType NumRegs = BestRegCount<vec_t, std::int16_t, TransformedFeatureDimensions, 16>;
static constexpr IndexType NumRegs = BestRegCount<vec_t, std::int16_t, TransformedFeatureDimensions, Is64Bit ? 16 : 8>;
static constexpr IndexType NumRegs = BestRegCount<vec_t, std::int16_t, TransformedFeatureDimensions, 8>;
static constexpr IndexType NumRegs = BestRegCount<vec_t, std::int16_t, TransformedFeatureDimensions, 16>;

static constexpr IndexType NumPsqtRegs = BestRegCount<psqt_vec_t, std::int32_t, PSQTBuckets, 32>;
static constexpr IndexType NumPsqtRegs = BestRegCount<psqt_vec_t, std::int32_t, PSQTBuckets, 16>;
static constexpr IndexType NumPsqtRegs = BestRegCount<psqt_vec_t, std::int32_t, PSQTBuckets, Is64Bit ? 16 : 8>;
static constexpr IndexType NumPsqtRegs = BestRegCount<psqt_vec_t, std::int32_t, PSQTBuckets, 8>;
static constexpr IndexType NumPsqtRegs = BestRegCount<psqt_vec_t, std::int32_t, PSQTBuckets, 16>;

Eagerly waiting for people trying to tune these constants with SPSA :-)

vondele · 2021-06-11T06:05:40Z

that fourth parameter is a property of the architecture (the available registers), and while one needs to look it up, it is relatively clear what it is. The code transforms that hardware-specific number to what is needed based on the implementation details (basically the other arguments of the template). I'm not arguing it is pretty, I'm just saying that without this code, I (and most other developers) would probably not be able to guess/compute the right values for NumRegs and NumPsqrtRegs.

noobpwnftw · 2021-06-11T06:12:32Z

Reference:
https://docs.microsoft.com/en-us/cpp/build/x64-software-conventions?view=msvc-160

So x86-64 arch has 16 registers(well known), where one can only pass up to 4 parameters via registers without extra cost which is not obvious. :)

snicolet · 2021-06-11T07:48:02Z

I suppose we are using the number of SIMD registers here, not the number of general purpose registers?

snicolet · 2021-06-11T07:54:23Z

Here is another version of the patch: https://github.com/snicolet/Stockfish/tree/optimal_register_count2

Sopel97 · 2021-06-11T08:40:16Z

The biggest point of this patch is that it allows testing nets like 384x2-32-32-1 without having to change these magic constants. Previously people were deterred by the compiler errors and had to ask on discord what to do.

I briefly tested it on AVX512 and found no measurable gain.

noobpwnftw · 2021-06-11T09:09:13Z

The point holds, we should derive those numbers from architecture definitions which are easier to understand.

gvreuls · 2021-06-11T09:27:13Z

master:

 Performance counter stats for 'system wide' (10 runs):

    10.428.204.827      cycles:u                                                      ( +-  0,08% )
    18.039.839.102      instructions:u            #    1,73  insn per cycle           ( +-  0,03% )

            2,6391 +- 0,0174 seconds time elapsed  ( +-  0,66% )

patch:

 Performance counter stats for 'system wide' (10 runs):

    10.444.438.767      cycles:u                                                      ( +-  0,23% )
    17.928.118.822      instructions:u            #    1,72  insn per cycle           ( +-  0,04% )

            2,6099 +- 0,0100 seconds time elapsed  ( +-  0,38% )

snicolet · 2021-06-11T19:44:32Z

The point holds, we should derive those numbers from architecture definitions which are easier to understand.

This is what I tried to do in https://github.com/snicolet/Stockfish/tree/optimal_register_count2 , where each SIMD code path only sets the number of SIMD registers in the NumRegistersSIMD macro.

Sopel97 · 2021-06-11T19:54:42Z

so in this PR it would be

  #define VECTOR

  static_assert(PSQTBuckets % 8 == 0,
    "Per feature PSQT values cannot be processed at granularity lower than 8 at a time.");

  #ifdef USE_AVX512
  typedef __m512i vec_t;
  typedef __m256i psqt_vec_t;
  #define vec_load(a) _mm512_load_si512(a)
  #define vec_store(a,b) _mm512_store_si512(a,b)
  #define vec_add_16(a,b) _mm512_add_epi16(a,b)
  #define vec_sub_16(a,b) _mm512_sub_epi16(a,b)
  #define vec_load_psqt(a) _mm256_load_si256(a)
  #define vec_store_psqt(a,b) _mm256_store_si256(a,b)
  #define vec_add_psqt_32(a,b) _mm256_add_epi32(a,b)
  #define vec_sub_psqt_32(a,b) _mm256_sub_epi32(a,b)
  #define vec_zero_psqt() _mm256_setzero_si256()
  static constexpr inline IndexType NumSimdRegisters = 32;
  static constexpr inline IndexType NumPsqtSimdRegisters = 32;

  #elif USE_AVX2
  typedef __m256i vec_t;
  typedef __m256i psqt_vec_t;
  #define vec_load(a) _mm256_load_si256(a)
  #define vec_store(a,b) _mm256_store_si256(a,b)
  #define vec_add_16(a,b) _mm256_add_epi16(a,b)
  #define vec_sub_16(a,b) _mm256_sub_epi16(a,b)
  #define vec_load_psqt(a) _mm256_load_si256(a)
  #define vec_store_psqt(a,b) _mm256_store_si256(a,b)
  #define vec_add_psqt_32(a,b) _mm256_add_epi32(a,b)
  #define vec_sub_psqt_32(a,b) _mm256_sub_epi32(a,b)
  #define vec_zero_psqt() _mm256_setzero_si256()
  static constexpr inline IndexType NumSimdRegisters = 16;
  static constexpr inline IndexType NumPsqtSimdRegisters = 16;

  #elif USE_SSE2
  typedef __m128i vec_t;
  typedef __m128i psqt_vec_t;
  #define vec_load(a) (*(a))
  #define vec_store(a,b) *(a)=(b)
  #define vec_add_16(a,b) _mm_add_epi16(a,b)
  #define vec_sub_16(a,b) _mm_sub_epi16(a,b)
  #define vec_load_psqt(a) (*(a))
  #define vec_store_psqt(a,b) *(a)=(b)
  #define vec_add_psqt_32(a,b) _mm_add_epi32(a,b)
  #define vec_sub_psqt_32(a,b) _mm_sub_epi32(a,b)
  #define vec_zero_psqt() _mm_setzero_si128()
  static constexpr inline IndexType NumSimdRegisters = Is64Bit ? 16 : 8;
  static constexpr inline IndexType NumPsqtSimdRegisters = Is64Bit ? 16 : 8;

  #elif USE_MMX
  typedef __m64 vec_t;
  typedef __m64 psqt_vec_t;
  #define vec_load(a) (*(a))
  #define vec_store(a,b) *(a)=(b)
  #define vec_add_16(a,b) _mm_add_pi16(a,b)
  #define vec_sub_16(a,b) _mm_sub_pi16(a,b)
  #define vec_load_psqt(a) (*(a))
  #define vec_store_psqt(a,b) *(a)=(b)
  #define vec_add_psqt_32(a,b) _mm_add_pi32(a,b)
  #define vec_sub_psqt_32(a,b) _mm_sub_pi32(a,b)
  #define vec_zero_psqt() _mm_setzero_si64()
  static constexpr inline IndexType NumSimdRegisters = 8;
  static constexpr inline IndexType NumPsqtSimdRegisters = 8;

  #elif USE_NEON
  typedef int16x8_t vec_t;
  typedef int32x4_t psqt_vec_t;
  #define vec_load(a) (*(a))
  #define vec_store(a,b) *(a)=(b)
  #define vec_add_16(a,b) vaddq_s16(a,b)
  #define vec_sub_16(a,b) vsubq_s16(a,b)
  #define vec_load_psqt(a) (*(a))
  #define vec_store_psqt(a,b) *(a)=(b)
  #define vec_add_psqt_32(a,b) vaddq_s32(a,b)
  #define vec_sub_psqt_32(a,b) vsubq_s32(a,b)
  #define vec_zero_psqt() psqt_vec_t{0}
  static constexpr inline IndexType NumSimdRegisters = 16;
  static constexpr inline IndexType NumPsqtSimdRegisters = 16;

  #else
  #undef VECTOR

  #endif

  #if defined VECTOR
  static constexpr IndexType NumRegs = BestRegCount<vec_t, std::int16_t, TransformedFeatureDimensions, NumSimdRegisters>;
  static constexpr IndexType NumPsqtRegs = BestRegCount<psqt_vec_t, std::int32_t, PSQTBuckets, NumPsqtSimdRegisters>;
  #endif

which I'm not sure is better

snicolet · 2021-06-11T20:15:36Z

Just curious: why are the vec_t and psqt_vec_t distinct on NEON, exactly? Obviously they have the same size, could we unify them like for all the other SIMD code paths?

Sopel97 · 2021-06-11T20:23:04Z

Note that AVX512 also uses different psqt_vec_t. On neon they just seem more type safe, the type indicates the lane types. I don't think it's possible to use an opaque type there.

snicolet · 2021-06-13T10:42:08Z

@Sopel97 Some questions:

a) Am I correct in https://github.com/snicolet/Stockfish/tree/optimal_register_count2 to assume that the last parameter of BestRegisterCount() in nnue_feature_transformer.h is the number of SIMD registers of the processor?

b) I see in nnue_common.h that we have there the SimdWidth variable, which as far as I understand is the size of each SIMD register.

b1) Are they always correct, even for 64bits/32bits SSE2?
b2) why is the SimdWidth sometimes divided by 2 or by 4 later in the code?
b3) why couldn't we put the NumRegistersSIMD variable in nnue_common.h too?

// SIMD width (in bytes)
  #if defined(USE_AVX2)
  constexpr std::size_t SimdWidth = 32;

  #elif defined(USE_SSE2)
  constexpr std::size_t SimdWidth = 16;

  #elif defined(USE_MMX)
  constexpr std::size_t SimdWidth = 8;

  #elif defined(USE_NEON)
  constexpr std::size_t SimdWidth = 16;
  #endif

[...]

#if defined (USE_AVX512)
    static constexpr const IndexType OutputSimdWidth = SimdWidth / 2;
#elif defined (USE_SSSE3)
    static constexpr const IndexType OutputSimdWidth = SimdWidth / 4;
#endif

Sopel97 · 2021-06-13T10:47:24Z

a) more precisely it's the number of registers that we're willing to use
b) not quite, it's wrong for AVX512
b2) a weird artifact from the past, where AVX512 code was really using AVX2 only. Now we have to maintain it. Also weight padding was only to 32 (and it is done in the serialized net, which is a terrible design decision but whatever).
b3) we could. But honestly all these SimdWidth constants and stuff are asking for a rewrite, like, what is width? byte width? lane width? what lane type? It needs to provide more information and be more explicit about what it provides. Also it's important to note that having access to AVX512 doesn't really mean we use it everywhere, which complicates the semantics.

to elaborate on b3, the interface I'd like to have would be something that encompasses all the simd type traits. So struct SimdTraits<SimdT> with stuff like SimdTraits<__m128 (or possibly some enum M128 instead, to avoid issues with passing these types and having i and f overloads)>::LaneCount<LaneT> where LaneT is one of [u]int(8,16,32,64) or float/double; SimdTraits<SimdT>::NumRegisters. And have an alias for using LargestSimdType = /* arch dependent */, SimdTraits<SimdT (the enum)>::Type<LaneT>
The traits could even be always defined, and set NumRegisters to 0 when not available.

I can try to implement my vision in the following few days and we'll see how it looks

snicolet · 2021-06-13T11:14:34Z

OK and good luck, please keep your vision as simple and first principles as possible :-)

Meanwhile I've merged the current documentation patch as ce4c523, thanks!

snicolet · 2021-06-13T11:16:18Z

I feel a little bit sorry to have merged a functional change for AVX512 without any speed data, but since nobody answered my question, well... :-)

Compute optimal register count for feature transformer accumulation d…

67077f4

…ynamically.

Sopel97 force-pushed the moreregsavx512 branch from 5c95b82 to 67077f4 Compare June 10, 2021 16:16

snicolet closed this in ce4c523 Jun 13, 2021

snicolet added the to be merged Will be merged shortly label Jun 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute optimal register count for feature transformer accumulation dynamically. #3543

Compute optimal register count for feature transformer accumulation dynamically. #3543

Sopel97 commented Jun 10, 2021

vondele commented Jun 10, 2021

Sopel97 commented Jun 10, 2021

snicolet commented Jun 11, 2021 •

edited

Loading

vondele commented Jun 11, 2021

snicolet commented Jun 11, 2021

snicolet commented Jun 11, 2021

vondele commented Jun 11, 2021 •

edited

Loading

snicolet commented Jun 11, 2021 •

edited

Loading

vondele commented Jun 11, 2021

noobpwnftw commented Jun 11, 2021

snicolet commented Jun 11, 2021

snicolet commented Jun 11, 2021

Sopel97 commented Jun 11, 2021 •

edited

Loading

noobpwnftw commented Jun 11, 2021

gvreuls commented Jun 11, 2021

snicolet commented Jun 11, 2021 •

edited

Loading

Sopel97 commented Jun 11, 2021

snicolet commented Jun 11, 2021

Sopel97 commented Jun 11, 2021 •

edited

Loading

snicolet commented Jun 13, 2021 •

edited

Loading

Sopel97 commented Jun 13, 2021 •

edited

Loading

snicolet commented Jun 13, 2021

snicolet commented Jun 13, 2021 •

edited

Loading

Compute optimal register count for feature transformer accumulation dynamically. #3543

Compute optimal register count for feature transformer accumulation dynamically. #3543

Conversation

Sopel97 commented Jun 10, 2021

vondele commented Jun 10, 2021

Sopel97 commented Jun 10, 2021

snicolet commented Jun 11, 2021 • edited Loading

vondele commented Jun 11, 2021

snicolet commented Jun 11, 2021

snicolet commented Jun 11, 2021

vondele commented Jun 11, 2021 • edited Loading

snicolet commented Jun 11, 2021 • edited Loading

vondele commented Jun 11, 2021

noobpwnftw commented Jun 11, 2021

snicolet commented Jun 11, 2021

snicolet commented Jun 11, 2021

Sopel97 commented Jun 11, 2021 • edited Loading

noobpwnftw commented Jun 11, 2021

gvreuls commented Jun 11, 2021

snicolet commented Jun 11, 2021 • edited Loading

Sopel97 commented Jun 11, 2021

snicolet commented Jun 11, 2021

Sopel97 commented Jun 11, 2021 • edited Loading

snicolet commented Jun 13, 2021 • edited Loading

Sopel97 commented Jun 13, 2021 • edited Loading

snicolet commented Jun 13, 2021

snicolet commented Jun 13, 2021 • edited Loading

snicolet commented Jun 11, 2021 •

edited

Loading

vondele commented Jun 11, 2021 •

edited

Loading

snicolet commented Jun 11, 2021 •

edited

Loading

Sopel97 commented Jun 11, 2021 •

edited

Loading

snicolet commented Jun 11, 2021 •

edited

Loading

Sopel97 commented Jun 11, 2021 •

edited

Loading

snicolet commented Jun 13, 2021 •

edited

Loading

Sopel97 commented Jun 13, 2021 •

edited

Loading

snicolet commented Jun 13, 2021 •

edited

Loading