Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Try to support BASEB == 64 #48

Closed
ilyakurdyukov opened this issue Jan 27, 2022 · 24 comments
Closed

Enhancement: Try to support BASEB == 64 #48

ilyakurdyukov opened this issue Jan 27, 2022 · 24 comments
Assignees

Comments

@ilyakurdyukov
Copy link

This is a synthetic double register size type, many architectures can compute twice as many bits from a multiplication, eg u64 * u64 = u128.

long long have the same meaning on 32-bit architectures as __int128 on 64-bit ones.

You might not like it, since it's a non-standard type (-pedantic prevents it from being used), but it will give twice faster addition/subtraction and 4-x faster multiplication.

The calc code seems to be very tied to HALF/FULL type definitions, but __int128 should only be used in certain places in zmath.c.

You can detect its availability by adding this to longbits.c:

#ifdef CHECK_B128
		if (sizeof(__int128) == 16) {
			printf("#undef HAVE_B128\n");
			printf("#define HAVE_B128\t\t/%s/\n",
			  "* have USB128 and SB128 types *");
			printf("typedef unsigned __int128 USB128;\t/%s/\n",
			  "* unsigned 128 bits *");
			printf("typedef __int128 SB128;\t\t/%s/\n",
			  "* signed 128 bits *");
			putchar('\n');
		}
#endif

And using this in the Makefile:

	${Q} ${LCC} ${ICFLAGS} -DCHECK_B128 longbits.c -c ${S} || ${LCC} ${ICFLAGS} longbits.c -c ${S}

So if compilation with CHECK_B128 fails, it will build without it.

@pmetzger
Copy link

Some platforms provide this type but do 128 bit arithmetic with multiple instructions emitted by the compiler so there's no performance advantage. I'm not sure how you can tell which is which.

@lcn2
Copy link
Owner

lcn2 commented Jan 28, 2022

Good idea @ilyakurdyukov .. thanks!

We would need to run a test to determine if __int128 was supported, @ilyakurdyukov, like it done for the 64 bit code.

While @pmetzger makes a good point, we would be surprised if __int128 would be a performance penalty over __int64.
And for those (rare?) cases where __int128 is sub-optimal, one could always use Makefile.local to force use of __int64.

We will certainly look into this matter as well as making it easier for someone to reduce for set the size to 64 or 32 bits.

Thanks again!

@ilyakurdyukov
Copy link
Author

ilyakurdyukov commented Jan 28, 2022

@pmetzger

Some platforms provide this type but do 128 bit arithmetic with multiple instructions emitted by the compiler so there's no performance advantage. I'm not sure how you can tell which is which.

There is an advantage on x86_64. And aarch64(arm64) does it as an extra instruction, but it's faster anyway. Because it would take four multiplications and a bunch of additions to simulate the same thing with half register sizes. (I assume that any 64-bit architecture should benefit from this, 64-bit architectures are easy to detect by pointer size or the __LP64__ macro).

I can say this for sure, because I had a project with multiplication of large numbers. And made optimizations to make it faster. Which also included trying to use SIMD instructions (no faster than using __int128), and profiling (PGO), and inline assembly (good inline assembly can replace profiling).

@ilyakurdyukov
Copy link
Author

ilyakurdyukov commented Jan 28, 2022

The performance of bignums multiplication on examples from my project:

x86_64 plain C code (u32*u32->u64): 3.388s
x86_64 SSE2 (u32*u32->u64): 2.431s
x86_64 AVX2 (u32*u32->u64): 1.357s
x86_64 plain C code (u64*u64->u128): 0.775s
x86_64 assembly inline (u64*u64->u128): 0.550s

  • SSE2/AVX2 is worth doing for x86 (32-bit), SIMD instructions don't have 64-bit multiply.

  • 3.388 / 0.775 = 4.37 times !

aarch64 plain C code (u32*u32->u64): 21.269s
aarch64 plain C code (u64*u64->u128): 9.911s

  • It's a cheap little devboard, which is why it's so slower than x86_64 results.

  • aarch64 uses an extra umulh instruction to get the higher part of the result, which explains why it's only twice faster.

elbrus-v5 plain C code (u32*u32->u64): 18.492s
elbrus-v5 SIMD (u32*u32->u64): 10.652s
elbrus-v5 plain C code (u64*u64->u128): 5.093s

  • Although this architecture uses an extra instruction to get the higher part of the result (same as for aarch64).

@ilyakurdyukov
Copy link
Author

ilyakurdyukov commented Jan 28, 2022

Some platforms provide this type but do 128 bit arithmetic with multiple instructions emitted by the compiler so there's no performance advantage. I'm not sure how you can tell which is which.

Also can you list those platforms? Because both GCC and Clang don't support __int128 when compiling test code for a 32-bit platform (-m32, tested for x86 and armv7-a). So I assume that if __int128 is available, then it's worth using it.

long long test(unsigned long long a) {
    return (unsigned __int128)a * a >> 64;
}

@lcn2
Copy link
Owner

lcn2 commented Jan 28, 2022

We think that calc could test for this in longbits.c without harm to systems that lack 128 bit values.

BTW: There is no int128_t because of reasons that "standards" reasons that are .. well:

https://stackoverflow.com/questions/29638723/why-isnt-there-int128-t

sigh

@ilyakurdyukov
Copy link
Author

Tests on godbolt.org show that __int128 is supported in GCC >= 4.6.4, GCC <= 4.5.3 does not.
And supported in Clang >= 3.1, Clang <= 3.0 does not.

Quite ancient versions of compilers.

I think we can expect __int128 today if:

#if defined(__GNUC__) && defined(__LP64__)
// ...
#endif

It also seems that recent (maybe all do that, but I didn't check) versions of GCC and Clang have this macro if __int128 is supported:

#define __SIZEOF_INT128__ 16

They do not define this macro in 32-bit mode, where __int128 is not available.

Hope this helps.

@lcn2
Copy link
Owner

lcn2 commented Jan 28, 2022

Is there a gcc and/or clang way to use __int128 and unsigned __int128 constants, as in something like:

__int128 a = 0x123456789abcdef0123456789abcdef0LLL; /* ??? */

Is there a gcc and/or clang way to printf __int128 and unsigned __int128 values, as in something line:

printf("%lllx\n", a); /* ??? */

@lcn2
Copy link
Owner

lcn2 commented Jan 28, 2022

It also seems that recent (maybe all do that, but I didn't check) versions of GCC and Clang have this macro if __int128 is supported:

It seems that is works in semi-recent versions of clang and gcc:

#if defined(__SIZEOF_INT128__)
    printf("yes we have __int128\n";
#else
    printf("no we do not have __int128\n");
#endif

@ilyakurdyukov
Copy link
Author

Is there a gcc and/or clang way to use __int128 and unsigned __int128 constants, as in something like:

__int128 a = 0x123456789abcdef0123456789abcdef0LLL; /* ??? */

Is there a gcc and/or clang way to printf __int128 and unsigned __int128 values, as in something line:

printf("%lllx\n", a); /* ??? */

I think there is no way to do this (at least not in a standardized way), so only use this type when reading/writing through that type's pointer.

But let me think:

#include <stdio.h>

int main() {
	__int128 a = (__int128)0x123456789abcdef << 64 | 0x0123456789abcdefU;
	printf("a = %016llx%016llx\n", (long long)(a >> 64), (long long)(a));
}

@ilyakurdyukov
Copy link
Author

This type should be used for:

  1. get a carry after adding/subtracting 64-bit integers.
  2. to get the high part of the multiplication result of two 64-bit integers.

For these purposes, this data type works well and speeds up the bignums code.

@ilyakurdyukov
Copy link
Author

The u128/u64 integer division seems inefficient, even though x86_64 has such an instruction, but both GCC and Clang issue a call to the __udivti3 function.

unsigned long long test_divide(unsigned long long *x) {
    unsigned __int128 a = (__int128)x[1] << 64 | x[0];
    return a / x[2];
}

But multiplication is well handled by both compilers.

@ilyakurdyukov
Copy link
Author

I think replacing HALF/FULL for all calc sources will require a lot of editing. And it may reduce the performance of some parts.

But I can just write an alternative version of zmuli() from zmath.c for 64 architectures with __int128 support. Are there other multiplication functions?

@lcn2
Copy link
Owner

lcn2 commented Jan 29, 2022

Here are some of our initial thoughts .. we need to think about this some more.

So there status of 128-bit integers appears to be complicated:

  • No 128-bit constants
  • No 128-bit %formating
  • Some arithemetic operations be by sub-optimal in certain cases on certain architectures

Our guess is that one would NOT want to compile in use of 128-bit values by default.
Perhaps a user would have to add some form of -DTRY_128BIT in Makefile.local?

One might be able to turn longbits.c / longbits.h (as you suggested above) so that it could produce:

#define LONG_BITS 128

#define HAVE_B128
typedef unsigned __int128 USB128;
typedef __int128 SB128;

But the U(x) and L(x) would need to remain forming just 64 bit constants.

Then zmath.h would need to be adjusted so that if HAVE_B128 AND TRY_128BIT were defined, it would
define BASEB to 64 and add in the proper swap macros, plus other stuff related to BASEB in zmath.h.
There are likely others issues there.

We would not want the main code to turn into lots of ifdef stuff just to experiment with 128-bit code.
However if the pain could be mostly confined to header files, then we might be willing to let people
try out as a non-default mode.

And that is just the start!

We would be happy to put out the initial compiling framework for such a non-default mode. If you were willing to fork that
and test/try it out to see if you can get calc to pass the regression check (make chk or the more verbose make check)
you would be welcome to experiment on your fork.

Let us think about this some more ...

@ilyakurdyukov
Copy link
Author

ilyakurdyukov commented Jan 29, 2022

Some arithemetic operations be by sub-optimal in certain cases on certain architectures

Same as 64-bit FULL would be on 32-bit architectures for some operations.

I don't think LONG_BITS = 128 will pass the tests, because there are functions that use the generic long type, which won't get bigger if I change the LONG_BITS.

E_FUNC void zmuli(ZVALUE z, long n, ZVALUE *res);
E_FUNC long zdivi(ZVALUE z, long n, ZVALUE *res);
E_FUNC long zmodi(ZVALUE z, long n);

And there are ifdefs everywhere "#if BASEB == 32, #if BASEB == 16" so I don't think it's possible for me to improve that than to do an ifdef around some functions and only improve those parts. Because I won't be able to correct all code for the new size. I can understand these parts, but not all of the code, and I will definitely break something trying to change types everywhere (without having any idea how to fix it).

@lcn2
Copy link
Owner

lcn2 commented Jan 29, 2022

A fundamental requirement to support multi-precision is that one must support these 4 BASEB operations:

  • Add two BASEB and get a BASEB sum with a carry bit
  • Subtract BASEB from BASEB and get a BASEB difference with a borrow bit
  • Multiply two BASEB and get a double BASEB product
  • Divide a double BASEB by a BASEB and get a BASEB dividend and a BASEB remainder

Along with these "I/O" integer needs:

  • Form an integer constant of BASEB size
  • Print an BASEB sized integer in base 10 and base 16

Can compilers manage generating machine code for double-precision 128 bit operations: i.e., produce 256 bit products and perform 256 bit division? If the answer is no, then a BASEB of 64 might be attempted while a BASEB of 128 would almost certainly fail right now with 128 bit integer ops.

We guess that the best one can do with 128-bit integers right now might be to support a BASEB of 64. That would be better than the default BASBE of 32 we have now, assuming that the 4 fundamental described above are computationally efficient: otherwise your overall calcuatons would run slower.

Simply being able to perform the 4 fundamental ops with a BASEB of 64 is not sufficient. One needs to be faster at BASEB == 64 than doing multi-precision with BASEB of 32.

You indicated that division with 128-bit integers might be slow. That is also why if BASEB of 64 is attempted, one might not yet want that to be the compiled default.

We warn you that the operations going in with the z*.c functions are not for the "faint of heart". Many years of very detailed work went into crafting that code. It is a solid case base today, with thousands of hours of testing and regression code to help keep it that way.

A half-way BASEB work (say supporting add and multiply, but not division) would NOT cut it. You really do need all 4 fundamental operations to work, to work exactly, and to be more efficient for your BASEB for all this to work.

Our advice is to focus just in BASEB of 64 for now. Look at the zmath.h macros needed to support a BASEB of 64 and let the rest of the z*.c code plus C compiler do the rest of the work.

You indicated that there isn't direct support for forming 128-bit constants. While annoying, that shouldn't stop one from attempting to try BASEB of 64. In zmath.h one could come up with the macros U(x) and L(x) needed to form integer constants for BASEB of 64.

There may be some work done in zrandom.c with forming the required constants for the various generators. But that would simply be a "get in a grind out the required arrays" work to support BASEB of 64. Doable work.

Just some more thoughts ... Let us ponder this some more.

@ilyakurdyukov
Copy link
Author

Divide a double BASEB by a BASEB and get a BASEB dividend and a BASEB remainder

And it can, but it's done through a library function call. What can be solved through the use of inline assembly.

But don't say that I present an entirely new problem. Because this problem has always been here. In x86 32-bit code, it's exactly the same, if you try divide uint64_t by uint32_t - you'll get a call to __udivdi3() in the assembly. Even if it can be done with a single instruction.

Can compilers manage generating machine code for double-precision 128 bit operations: i.e., produce 256 bit products and perform 256 bit division? If the answer is no, then a BASEB of 64 might be attempted while a BASEB of 128 would almost certainly fail right now with 128 bit integer ops.

I suggest to use BASEB=64 for 64-bit architectures, where FULL is unsigned __int128 and HALF is uint64_t because it's currently only 32. So there's a performance penalty because 64-bit hardware can do it times faster.

I'm not talking about BASEB=128, it makes no sense, only if there were architectures with 128-bit arithmetic.

@ilyakurdyukov
Copy link
Author

You indicated that division with 128-bit integers might be slow. That is also why if BASEB of 64 is attempted, one might not yet want that to be the compiled default.

Even calling a function that should be a wrapper for a division instruction, but with extra sanity checks, should be faster than a two division instructions (but it even requires four, just like multiplication). And as I said before, this is not a new problem, you already have this problem on 32-bit architectures, but you didn't know about it until now.

I wonder if there is a built-in function to use this instruction directly without extra call and sanity checks (which can be solved with inline assembly, but it's better to use built-in).

So you need to think about solution for this anyway (or just use the function call as before if BASEB matches the architecture bits ).

Our advice is to focus just in BASEB of 64 for now.

There's a big misunderstanding here if you thought I wanted BASEB=128, it was BASEB=64 that I wanted from the beginning.

@ilyakurdyukov
Copy link
Author

You indicated that there isn't direct support for forming 128-bit constants. While annoying, that shouldn't stop one from attempting to try BASEB of 64. In zmath.h one could come up with the macros U(x) and L(x) needed to form integer constants for BASEB of 64.

Is there a lot of code that uses these macros?

While annoying, that shouldn't stop one from attempting to try BASEB of 64.

This stopped me from making these changes by myself. Because I don't know this code well enough.

@lcn2 lcn2 changed the title Use the __int128 type for better performance. Enhancement: Try to support BASEB == 64 Jan 30, 2022
@lcn2
Copy link
Owner

lcn2 commented Jan 30, 2022

We can look into a BASEB of 64 to see if this is realistic. If it is realistic, then it needs to be optional instead of a default.

This will take a while.

@lcn2 lcn2 self-assigned this Mar 6, 2023
@lcn2
Copy link
Owner

lcn2 commented Aug 21, 2023

Hello @ilyakurdyukov,

We believe we have an approach that would allow your requested enhancement.

Please see pull #98 as this has a direct bearing on being able to support BASEB == 64.

@lcn2
Copy link
Owner

lcn2 commented Sep 6, 2023

We plan to implement BASEB of 64 in calc version 3.

@lcn2
Copy link
Owner

lcn2 commented Oct 4, 2023

This feature will be considered in issue #103 when calc v3 is released

@lcn2
Copy link
Owner

lcn2 commented Oct 4, 2023

As this issue will be folded into issue #103, we are closing this issue and moving future discussion over there.

@lcn2 lcn2 closed this as completed Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants