-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: Try to support BASEB == 64 #48
Comments
Some platforms provide this type but do 128 bit arithmetic with multiple instructions emitted by the compiler so there's no performance advantage. I'm not sure how you can tell which is which. |
Good idea @ilyakurdyukov .. thanks! We would need to run a test to determine if __int128 was supported, @ilyakurdyukov, like it done for the 64 bit code. While @pmetzger makes a good point, we would be surprised if __int128 would be a performance penalty over __int64. We will certainly look into this matter as well as making it easier for someone to reduce for set the size to 64 or 32 bits. Thanks again! |
There is an advantage on x86_64. And aarch64(arm64) does it as an extra instruction, but it's faster anyway. Because it would take four multiplications and a bunch of additions to simulate the same thing with half register sizes. (I assume that any 64-bit architecture should benefit from this, 64-bit architectures are easy to detect by pointer size or the I can say this for sure, because I had a project with multiplication of large numbers. And made optimizations to make it faster. Which also included trying to use SIMD instructions (no faster than using __int128), and profiling (PGO), and inline assembly (good inline assembly can replace profiling). |
The performance of bignums multiplication on examples from my project: x86_64 plain C code (
aarch64 plain C code (
elbrus-v5 plain C code (
|
Also can you list those platforms? Because both GCC and Clang don't support long long test(unsigned long long a) {
return (unsigned __int128)a * a >> 64;
} |
We think that calc could test for this in longbits.c without harm to systems that lack 128 bit values. BTW: There is no int128_t because of reasons that "standards" reasons that are .. well: https://stackoverflow.com/questions/29638723/why-isnt-there-int128-t sigh |
Tests on godbolt.org show that __int128 is supported in GCC >= 4.6.4, GCC <= 4.5.3 does not. Quite ancient versions of compilers. I think we can expect __int128 today if:
It also seems that recent (maybe all do that, but I didn't check) versions of GCC and Clang have this macro if __int128 is supported:
They do not define this macro in 32-bit mode, where __int128 is not available. Hope this helps. |
Is there a gcc and/or clang way to use __int128 and unsigned __int128 constants, as in something like: __int128 a = 0x123456789abcdef0123456789abcdef0LLL; /* ??? */ Is there a gcc and/or clang way to printf __int128 and unsigned __int128 values, as in something line: printf("%lllx\n", a); /* ??? */ |
It seems that is works in semi-recent versions of clang and gcc: #if defined(__SIZEOF_INT128__)
printf("yes we have __int128\n";
#else
printf("no we do not have __int128\n");
#endif |
I think there is no way to do this (at least not in a standardized way), so only use this type when reading/writing through that type's pointer. But let me think: #include <stdio.h>
int main() {
__int128 a = (__int128)0x123456789abcdef << 64 | 0x0123456789abcdefU;
printf("a = %016llx%016llx\n", (long long)(a >> 64), (long long)(a));
} |
This type should be used for:
For these purposes, this data type works well and speeds up the bignums code. |
The u128/u64 integer division seems inefficient, even though x86_64 has such an instruction, but both GCC and Clang issue a call to the unsigned long long test_divide(unsigned long long *x) {
unsigned __int128 a = (__int128)x[1] << 64 | x[0];
return a / x[2];
} But multiplication is well handled by both compilers. |
I think replacing HALF/FULL for all But I can just write an alternative version of |
Here are some of our initial thoughts .. we need to think about this some more. So there status of 128-bit integers appears to be complicated:
Our guess is that one would NOT want to compile in use of 128-bit values by default. One might be able to turn longbits.c / longbits.h (as you suggested above) so that it could produce: #define LONG_BITS 128 #define HAVE_B128 But the U(x) and L(x) would need to remain forming just 64 bit constants. Then zmath.h would need to be adjusted so that if HAVE_B128 AND TRY_128BIT were defined, it would We would not want the main code to turn into lots of ifdef stuff just to experiment with 128-bit code. And that is just the start! We would be happy to put out the initial compiling framework for such a non-default mode. If you were willing to fork that Let us think about this some more ... |
Same as 64-bit FULL would be on 32-bit architectures for some operations. I don't think LONG_BITS = 128 will pass the tests, because there are functions that use the generic long type, which won't get bigger if I change the LONG_BITS. E_FUNC void zmuli(ZVALUE z, long n, ZVALUE *res); And there are ifdefs everywhere "#if BASEB == 32, #if BASEB == 16" so I don't think it's possible for me to improve that than to do an ifdef around some functions and only improve those parts. Because I won't be able to correct all code for the new size. I can understand these parts, but not all of the code, and I will definitely break something trying to change types everywhere (without having any idea how to fix it). |
A fundamental requirement to support multi-precision is that one must support these 4 BASEB operations:
Along with these "I/O" integer needs:
Can compilers manage generating machine code for double-precision 128 bit operations: i.e., produce 256 bit products and perform 256 bit division? If the answer is no, then a BASEB of 64 might be attempted while a BASEB of 128 would almost certainly fail right now with 128 bit integer ops. We guess that the best one can do with 128-bit integers right now might be to support a BASEB of 64. That would be better than the default BASBE of 32 we have now, assuming that the 4 fundamental described above are computationally efficient: otherwise your overall calcuatons would run slower. Simply being able to perform the 4 fundamental ops with a BASEB of 64 is not sufficient. One needs to be faster at BASEB == 64 than doing multi-precision with BASEB of 32. You indicated that division with 128-bit integers might be slow. That is also why if BASEB of 64 is attempted, one might not yet want that to be the compiled default. We warn you that the operations going in with the z*.c functions are not for the "faint of heart". Many years of very detailed work went into crafting that code. It is a solid case base today, with thousands of hours of testing and regression code to help keep it that way. A half-way BASEB work (say supporting add and multiply, but not division) would NOT cut it. You really do need all 4 fundamental operations to work, to work exactly, and to be more efficient for your BASEB for all this to work. Our advice is to focus just in BASEB of 64 for now. Look at the zmath.h macros needed to support a BASEB of 64 and let the rest of the z*.c code plus C compiler do the rest of the work. You indicated that there isn't direct support for forming 128-bit constants. While annoying, that shouldn't stop one from attempting to try BASEB of 64. In zmath.h one could come up with the macros U(x) and L(x) needed to form integer constants for BASEB of 64. There may be some work done in zrandom.c with forming the required constants for the various generators. But that would simply be a "get in a grind out the required arrays" work to support BASEB of 64. Doable work. Just some more thoughts ... Let us ponder this some more. |
And it can, but it's done through a library function call. What can be solved through the use of inline assembly. But don't say that I present an entirely new problem. Because this problem has always been here. In x86 32-bit code, it's exactly the same, if you try divide uint64_t by uint32_t - you'll get a call to __udivdi3() in the assembly. Even if it can be done with a single instruction.
I suggest to use BASEB=64 for 64-bit architectures, where FULL is I'm not talking about BASEB=128, it makes no sense, only if there were architectures with 128-bit arithmetic. |
Even calling a function that should be a wrapper for a division instruction, but with extra sanity checks, should be faster than a two division instructions (but it even requires four, just like multiplication). And as I said before, this is not a new problem, you already have this problem on 32-bit architectures, but you didn't know about it until now. I wonder if there is a built-in function to use this instruction directly without extra call and sanity checks (which can be solved with inline assembly, but it's better to use built-in). So you need to think about solution for this anyway (or just use the function call as before if BASEB matches the architecture bits ).
There's a big misunderstanding here if you thought I wanted BASEB=128, it was BASEB=64 that I wanted from the beginning. |
Is there a lot of code that uses these macros?
This stopped me from making these changes by myself. Because I don't know this code well enough. |
We can look into a BASEB of 64 to see if this is realistic. If it is realistic, then it needs to be optional instead of a default. This will take a while. |
Hello @ilyakurdyukov, We believe we have an approach that would allow your requested enhancement. Please see pull #98 as this has a direct bearing on being able to support |
We plan to implement BASEB of 64 in calc version 3. |
This feature will be considered in issue #103 when calc v3 is released |
As this issue will be folded into issue #103, we are closing this issue and moving future discussion over there. |
This is a synthetic double register size type, many architectures can compute twice as many bits from a multiplication, eg u64 * u64 = u128.
long long
have the same meaning on 32-bit architectures as__int128
on 64-bit ones.You might not like it, since it's a non-standard type (
-pedantic
prevents it from being used), but it will give twice faster addition/subtraction and 4-x faster multiplication.The calc code seems to be very tied to HALF/FULL type definitions, but __int128 should only be used in certain places in
zmath.c
.You can detect its availability by adding this to
longbits.c
:And using this in the Makefile:
So if compilation with CHECK_B128 fails, it will build without it.
The text was updated successfully, but these errors were encountered: