Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute optimal register count for feature transformer accumulation dynamically. #3543

Closed
wants to merge 1 commit into from

Conversation

Sopel97
Copy link
Member

@Sopel97 Sopel97 commented Jun 10, 2021

This also fixes a "bug" where AVX512 would only use 8 registers instead of 16 (now possible due to a 2x increase in FT size).

@vondele
Copy link
Member

vondele commented Jun 10, 2021

architectures like 512x24x16 don't compile on master, or with this patch. Is this something that can be fixed in the same round?

@Sopel97
Copy link
Member Author

Sopel97 commented Jun 10, 2021

No. It would need completely different code for the affine transform.

@snicolet
Copy link
Member

snicolet commented Jun 11, 2021

Anybody has speed data for this patch? And fishtest data?

@vondele
Copy link
Member

vondele commented Jun 11, 2021

I believe this makes essentially no difference at run time, since the constants are computed at compile-time, and are the same except for avx512. For the latter architecture, one should measure a speedup, which might be worth verifying with a bench.

I think it is better than hard-coding the constants as we do now.

@snicolet
Copy link
Member

For the latter architecture, one should measure a speedup, which might be worth verifying with a bench.

Indeed it might! :-) It is not light to add 48 lines to Stockfish code just for the fun of it, because it feels better and might be a speed gain is some processor which is not mainstream at the moment.

@snicolet
Copy link
Member

Bottom line is that the complexity and size of the NNUE code part are exploding. We should try to simplify it, not complexify it for the sake of 0.1% unproven speed-ups or algorithmic brilliancy satisfaction.

@vondele
Copy link
Member

vondele commented Jun 11, 2021

yes, I understand that argument. I think the value of the patch is removing the 'magic constants' for each processor, which are not so obvious, somehow the added code is like documentation that explains how to obtain them.

essentially, there was a mistake in the avx512 constants, because it is non-obvious.

@snicolet
Copy link
Member

snicolet commented Jun 11, 2021

There is as many magic constants in the patch as in master, just they are hidden in the fourth parameter(!) of the templates for calculating "optimal register count".

static constexpr IndexType NumRegs = BestRegCount<vec_t, std::int16_t, TransformedFeatureDimensions, 32>;
static constexpr IndexType NumRegs = BestRegCount<vec_t, std::int16_t, TransformedFeatureDimensions, 16>;
static constexpr IndexType NumRegs = BestRegCount<vec_t, std::int16_t, TransformedFeatureDimensions, Is64Bit ? 16 : 8>;
static constexpr IndexType NumRegs = BestRegCount<vec_t, std::int16_t, TransformedFeatureDimensions, 8>;
static constexpr IndexType NumRegs = BestRegCount<vec_t, std::int16_t, TransformedFeatureDimensions, 16>;

static constexpr IndexType NumPsqtRegs = BestRegCount<psqt_vec_t, std::int32_t, PSQTBuckets, 32>;
static constexpr IndexType NumPsqtRegs = BestRegCount<psqt_vec_t, std::int32_t, PSQTBuckets, 16>;
static constexpr IndexType NumPsqtRegs = BestRegCount<psqt_vec_t, std::int32_t, PSQTBuckets, Is64Bit ? 16 : 8>;
static constexpr IndexType NumPsqtRegs = BestRegCount<psqt_vec_t, std::int32_t, PSQTBuckets, 8>;
static constexpr IndexType NumPsqtRegs = BestRegCount<psqt_vec_t, std::int32_t, PSQTBuckets, 16>;

Eagerly waiting for people trying to tune these constants with SPSA :-)

@vondele
Copy link
Member

vondele commented Jun 11, 2021

that fourth parameter is a property of the architecture (the available registers), and while one needs to look it up, it is relatively clear what it is. The code transforms that hardware-specific number to what is needed based on the implementation details (basically the other arguments of the template). I'm not arguing it is pretty, I'm just saying that without this code, I (and most other developers) would probably not be able to guess/compute the right values for NumRegs and NumPsqrtRegs.

@noobpwnftw
Copy link
Contributor

Reference:
https://docs.microsoft.com/en-us/cpp/build/x64-software-conventions?view=msvc-160

So x86-64 arch has 16 registers(well known), where one can only pass up to 4 parameters via registers without extra cost which is not obvious. :)

@snicolet
Copy link
Member

I suppose we are using the number of SIMD registers here, not the number of general purpose registers?

@snicolet
Copy link
Member

Here is another version of the patch: https://github.com/snicolet/Stockfish/tree/optimal_register_count2

@Sopel97
Copy link
Member Author

Sopel97 commented Jun 11, 2021

The biggest point of this patch is that it allows testing nets like 384x2-32-32-1 without having to change these magic constants. Previously people were deterred by the compiler errors and had to ask on discord what to do.

I briefly tested it on AVX512 and found no measurable gain.

@noobpwnftw
Copy link
Contributor

The point holds, we should derive those numbers from architecture definitions which are easier to understand.

@gvreuls
Copy link
Contributor

gvreuls commented Jun 11, 2021

master:

 Performance counter stats for 'system wide' (10 runs):

    10.428.204.827      cycles:u                                                      ( +-  0,08% )
    18.039.839.102      instructions:u            #    1,73  insn per cycle           ( +-  0,03% )

            2,6391 +- 0,0174 seconds time elapsed  ( +-  0,66% )

patch:

 Performance counter stats for 'system wide' (10 runs):

    10.444.438.767      cycles:u                                                      ( +-  0,23% )
    17.928.118.822      instructions:u            #    1,72  insn per cycle           ( +-  0,04% )

            2,6099 +- 0,0100 seconds time elapsed  ( +-  0,38% )

@snicolet
Copy link
Member

snicolet commented Jun 11, 2021

The point holds, we should derive those numbers from architecture definitions which are easier to understand.

This is what I tried to do in https://github.com/snicolet/Stockfish/tree/optimal_register_count2 , where each SIMD code path only sets the number of SIMD registers in the NumRegistersSIMD macro.

@Sopel97
Copy link
Member Author

Sopel97 commented Jun 11, 2021

so in this PR it would be

  #define VECTOR

  static_assert(PSQTBuckets % 8 == 0,
    "Per feature PSQT values cannot be processed at granularity lower than 8 at a time.");

  #ifdef USE_AVX512
  typedef __m512i vec_t;
  typedef __m256i psqt_vec_t;
  #define vec_load(a) _mm512_load_si512(a)
  #define vec_store(a,b) _mm512_store_si512(a,b)
  #define vec_add_16(a,b) _mm512_add_epi16(a,b)
  #define vec_sub_16(a,b) _mm512_sub_epi16(a,b)
  #define vec_load_psqt(a) _mm256_load_si256(a)
  #define vec_store_psqt(a,b) _mm256_store_si256(a,b)
  #define vec_add_psqt_32(a,b) _mm256_add_epi32(a,b)
  #define vec_sub_psqt_32(a,b) _mm256_sub_epi32(a,b)
  #define vec_zero_psqt() _mm256_setzero_si256()
  static constexpr inline IndexType NumSimdRegisters = 32;
  static constexpr inline IndexType NumPsqtSimdRegisters = 32;

  #elif USE_AVX2
  typedef __m256i vec_t;
  typedef __m256i psqt_vec_t;
  #define vec_load(a) _mm256_load_si256(a)
  #define vec_store(a,b) _mm256_store_si256(a,b)
  #define vec_add_16(a,b) _mm256_add_epi16(a,b)
  #define vec_sub_16(a,b) _mm256_sub_epi16(a,b)
  #define vec_load_psqt(a) _mm256_load_si256(a)
  #define vec_store_psqt(a,b) _mm256_store_si256(a,b)
  #define vec_add_psqt_32(a,b) _mm256_add_epi32(a,b)
  #define vec_sub_psqt_32(a,b) _mm256_sub_epi32(a,b)
  #define vec_zero_psqt() _mm256_setzero_si256()
  static constexpr inline IndexType NumSimdRegisters = 16;
  static constexpr inline IndexType NumPsqtSimdRegisters = 16;

  #elif USE_SSE2
  typedef __m128i vec_t;
  typedef __m128i psqt_vec_t;
  #define vec_load(a) (*(a))
  #define vec_store(a,b) *(a)=(b)
  #define vec_add_16(a,b) _mm_add_epi16(a,b)
  #define vec_sub_16(a,b) _mm_sub_epi16(a,b)
  #define vec_load_psqt(a) (*(a))
  #define vec_store_psqt(a,b) *(a)=(b)
  #define vec_add_psqt_32(a,b) _mm_add_epi32(a,b)
  #define vec_sub_psqt_32(a,b) _mm_sub_epi32(a,b)
  #define vec_zero_psqt() _mm_setzero_si128()
  static constexpr inline IndexType NumSimdRegisters = Is64Bit ? 16 : 8;
  static constexpr inline IndexType NumPsqtSimdRegisters = Is64Bit ? 16 : 8;

  #elif USE_MMX
  typedef __m64 vec_t;
  typedef __m64 psqt_vec_t;
  #define vec_load(a) (*(a))
  #define vec_store(a,b) *(a)=(b)
  #define vec_add_16(a,b) _mm_add_pi16(a,b)
  #define vec_sub_16(a,b) _mm_sub_pi16(a,b)
  #define vec_load_psqt(a) (*(a))
  #define vec_store_psqt(a,b) *(a)=(b)
  #define vec_add_psqt_32(a,b) _mm_add_pi32(a,b)
  #define vec_sub_psqt_32(a,b) _mm_sub_pi32(a,b)
  #define vec_zero_psqt() _mm_setzero_si64()
  static constexpr inline IndexType NumSimdRegisters = 8;
  static constexpr inline IndexType NumPsqtSimdRegisters = 8;

  #elif USE_NEON
  typedef int16x8_t vec_t;
  typedef int32x4_t psqt_vec_t;
  #define vec_load(a) (*(a))
  #define vec_store(a,b) *(a)=(b)
  #define vec_add_16(a,b) vaddq_s16(a,b)
  #define vec_sub_16(a,b) vsubq_s16(a,b)
  #define vec_load_psqt(a) (*(a))
  #define vec_store_psqt(a,b) *(a)=(b)
  #define vec_add_psqt_32(a,b) vaddq_s32(a,b)
  #define vec_sub_psqt_32(a,b) vsubq_s32(a,b)
  #define vec_zero_psqt() psqt_vec_t{0}
  static constexpr inline IndexType NumSimdRegisters = 16;
  static constexpr inline IndexType NumPsqtSimdRegisters = 16;

  #else
  #undef VECTOR

  #endif

  #if defined VECTOR
  static constexpr IndexType NumRegs = BestRegCount<vec_t, std::int16_t, TransformedFeatureDimensions, NumSimdRegisters>;
  static constexpr IndexType NumPsqtRegs = BestRegCount<psqt_vec_t, std::int32_t, PSQTBuckets, NumPsqtSimdRegisters>;
  #endif

which I'm not sure is better

@snicolet
Copy link
Member

Just curious: why are the vec_t and psqt_vec_t distinct on NEON, exactly? Obviously they have the same size, could we unify them like for all the other SIMD code paths?

@Sopel97
Copy link
Member Author

Sopel97 commented Jun 11, 2021

Note that AVX512 also uses different psqt_vec_t. On neon they just seem more type safe, the type indicates the lane types. I don't think it's possible to use an opaque type there.

@snicolet
Copy link
Member

snicolet commented Jun 13, 2021

@Sopel97 Some questions:

a) Am I correct in https://github.com/snicolet/Stockfish/tree/optimal_register_count2 to assume that the last parameter of BestRegisterCount() in nnue_feature_transformer.h is the number of SIMD registers of the processor?

b) I see in nnue_common.h that we have there the SimdWidth variable, which as far as I understand is the size of each SIMD register.

b1) Are they always correct, even for 64bits/32bits SSE2?
b2) why is the SimdWidth sometimes divided by 2 or by 4 later in the code?
b3) why couldn't we put the NumRegistersSIMD variable in nnue_common.h too?

// SIMD width (in bytes)
  #if defined(USE_AVX2)
  constexpr std::size_t SimdWidth = 32;

  #elif defined(USE_SSE2)
  constexpr std::size_t SimdWidth = 16;

  #elif defined(USE_MMX)
  constexpr std::size_t SimdWidth = 8;

  #elif defined(USE_NEON)
  constexpr std::size_t SimdWidth = 16;
  #endif

[...]

#if defined (USE_AVX512)
    static constexpr const IndexType OutputSimdWidth = SimdWidth / 2;
#elif defined (USE_SSSE3)
    static constexpr const IndexType OutputSimdWidth = SimdWidth / 4;
#endif

@Sopel97
Copy link
Member Author

Sopel97 commented Jun 13, 2021

a) more precisely it's the number of registers that we're willing to use
b) not quite, it's wrong for AVX512
b2) a weird artifact from the past, where AVX512 code was really using AVX2 only. Now we have to maintain it. Also weight padding was only to 32 (and it is done in the serialized net, which is a terrible design decision but whatever).
b3) we could. But honestly all these SimdWidth constants and stuff are asking for a rewrite, like, what is width? byte width? lane width? what lane type? It needs to provide more information and be more explicit about what it provides. Also it's important to note that having access to AVX512 doesn't really mean we use it everywhere, which complicates the semantics.

to elaborate on b3, the interface I'd like to have would be something that encompasses all the simd type traits. So struct SimdTraits<SimdT> with stuff like SimdTraits<__m128 (or possibly some enum M128 instead, to avoid issues with passing these types and having i and f overloads)>::LaneCount<LaneT> where LaneT is one of [u]int(8,16,32,64) or float/double; SimdTraits<SimdT>::NumRegisters. And have an alias for using LargestSimdType = /* arch dependent */, SimdTraits<SimdT (the enum)>::Type<LaneT>
The traits could even be always defined, and set NumRegisters to 0 when not available.

I can try to implement my vision in the following few days and we'll see how it looks

@snicolet snicolet closed this in ce4c523 Jun 13, 2021
@snicolet snicolet added the to be merged Will be merged shortly label Jun 13, 2021
@snicolet
Copy link
Member

OK and good luck, please keep your vision as simple and first principles as possible :-)

Meanwhile I've merged the current documentation patch as ce4c523, thanks!

@snicolet
Copy link
Member

snicolet commented Jun 13, 2021

I feel a little bit sorry to have merged a functional change for AVX512 without any speed data, but since nobody answered my question, well... :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
to be merged Will be merged shortly
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants