-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute optimal register count for feature transformer accumulation dynamically. #3543
Conversation
architectures like 512x24x16 don't compile on master, or with this patch. Is this something that can be fixed in the same round? |
No. It would need completely different code for the affine transform. |
Anybody has speed data for this patch? And fishtest data? |
I believe this makes essentially no difference at run time, since the constants are computed at compile-time, and are the same except for avx512. For the latter architecture, one should measure a speedup, which might be worth verifying with a bench. I think it is better than hard-coding the constants as we do now. |
Indeed it might! :-) It is not light to add 48 lines to Stockfish code just for the fun of it, because it feels better and might be a speed gain is some processor which is not mainstream at the moment. |
Bottom line is that the complexity and size of the NNUE code part are exploding. We should try to simplify it, not complexify it for the sake of 0.1% unproven speed-ups or algorithmic brilliancy satisfaction. |
yes, I understand that argument. I think the value of the patch is removing the 'magic constants' for each processor, which are not so obvious, somehow the added code is like documentation that explains how to obtain them. essentially, there was a mistake in the avx512 constants, because it is non-obvious. |
There is as many magic constants in the patch as in master, just they are hidden in the fourth parameter(!) of the templates for calculating "optimal register count".
Eagerly waiting for people trying to tune these constants with SPSA :-) |
that fourth parameter is a property of the architecture (the available registers), and while one needs to look it up, it is relatively clear what it is. The code transforms that hardware-specific number to what is needed based on the implementation details (basically the other arguments of the template). I'm not arguing it is pretty, I'm just saying that without this code, I (and most other developers) would probably not be able to guess/compute the right values for NumRegs and NumPsqrtRegs. |
Reference: So x86-64 arch has 16 registers(well known), where one can only pass up to 4 parameters via registers without extra cost which is not obvious. :) |
I suppose we are using the number of SIMD registers here, not the number of general purpose registers? |
Here is another version of the patch: https://github.com/snicolet/Stockfish/tree/optimal_register_count2 |
The biggest point of this patch is that it allows testing nets like 384x2-32-32-1 without having to change these magic constants. Previously people were deterred by the compiler errors and had to ask on discord what to do. I briefly tested it on AVX512 and found no measurable gain. |
The point holds, we should derive those numbers from architecture definitions which are easier to understand. |
master:
patch:
|
This is what I tried to do in https://github.com/snicolet/Stockfish/tree/optimal_register_count2 , where each SIMD code path only sets the number of SIMD registers in the |
so in this PR it would be
which I'm not sure is better |
Just curious: why are the |
Note that AVX512 also uses different psqt_vec_t. On neon they just seem more type safe, the type indicates the lane types. I don't think it's possible to use an opaque type there. |
@Sopel97 Some questions: a) Am I correct in https://github.com/snicolet/Stockfish/tree/optimal_register_count2 to assume that the last parameter of BestRegisterCount() in b) I see in b1) Are they always correct, even for 64bits/32bits SSE2?
|
a) more precisely it's the number of registers that we're willing to use to elaborate on b3, the interface I'd like to have would be something that encompasses all the simd type traits. So I can try to implement my vision in the following few days and we'll see how it looks |
OK and good luck, please keep your vision as simple and first principles as possible :-) Meanwhile I've merged the current documentation patch as ce4c523, thanks! |
I feel a little bit sorry to have merged a functional change for AVX512 without any speed data, but since nobody answered my question, well... :-) |
This also fixes a "bug" where AVX512 would only use 8 registers instead of 16 (now possible due to a 2x increase in FT size).